project Feb 28, 2026

How Do You Know If Someone Is Actually Good at Using AI?

Reading style:
ai education hiring assessment

TLDR: Everyone says they use AI, but nobody can prove they’re good at it. I’m designing a platform that tests how well humans wield AI tools by capturing the full interaction process — not just the output, but how you got there. The scoring model is the moat.


The problem

Here’s a conversation I keep having. Someone says “I use AI every day” and I think — okay, but what does that mean? Are you copying and pasting into ChatGPT and accepting whatever comes back? Or are you breaking problems into pieces, choosing the right tool for each part, catching errors, and chaining steps together into something that actually works?

Those are wildly different skill levels, and right now there’s no way to tell them apart. Resumes can’t capture it. Interviews barely try. Certifications test whether you memorized facts about AI, not whether you can actually wield it.

This matters because AI literacy is becoming the baseline professional skill. Not knowing how to use AI effectively is like not knowing how to use a spreadsheet in 2005 — you can get by for a while, but the gap keeps growing. And employers, training programs, and schools all need a way to measure it.

What I’m designing

AI Arena is a platform that tests how well humans use AI to solve real problems. Not a quiz about what AI is — a hands-on assessment where you actually use AI tools under time pressure, and the system watches how you do it.

Here’s how it works: you get a problem (research brief, data analysis, content creation, coding task, strategic question). You have a time limit and access to AI tools. You solve it however you want. When you’re done, the system scores two things.

First, the output. Did you actually solve the problem? Is the work good? This part is straightforward.

Second — and this is where it gets interesting — the process. The system captures your full interaction with the AI. How many attempts did it take? Did you plan before prompting or just start typing? When the AI gave you something wrong, how fast did you notice? Did you verify the output or trust it blindly? Did you use the right tool for each sub-task?

The process score is the real insight. Two people might produce the same quality output, but one did it in 3 prompts with a clear strategy, and the other thrashed through 30 attempts. That difference is AI literacy.

Six things we’re measuring

I’ve broken AI literacy into six dimensions:

Problem decomposition. Can you look at a messy, ambiguous problem and break it into pieces that AI can actually help with? This is the strategic thinking layer — knowing what to delegate and what to do yourself.

Prompt craft. Not just “write a good prompt” but knowing when a detailed prompt matters, when a vague one is fine, and when prompting is the wrong approach entirely.

Tool selection. Chat, code agents, search, image generation — each has strengths. Skilled users pick the right tool. Unskilled users use chat for everything.

Iteration and recovery. AI gives bad output sometimes. How fast do you notice? How do you adjust? Do you re-prompt, try a different approach, or just accept garbage?

Verification and judgment. Do you check the work? Do you catch hallucinations? Do you know when to trust AI output and when to verify independently?

Orchestration. Can you chain multiple AI steps into a coherent workflow? This is the advanced skill — using AI not as a single tool but as a system.

Why this hasn’t been built before

Existing tools evaluate either the AI (Chatbot Arena) or traditional skills (LeetCode, Kaggle). Nobody evaluates the human-AI collaboration skill. The closest things are AI certifications, and those are multiple-choice knowledge tests — they tell you someone read about AI, not that they can use it.

The hard part is scoring the process. Output scoring is relatively straightforward (rubric + LLM-as-judge). But reliably scoring how someone interacted with AI — that requires capturing the full interaction log and having meaningful metrics for what “good” looks like. That scoring model is both the hardest part to build and the moat once it’s calibrated.

Where this goes

For hiring, it’s a practical signal: does this candidate actually know how to leverage AI, or are they just saying they do? For training programs, it’s pre/post measurement: did this workshop actually make people better? For individuals, it’s self-assessment: where are my gaps?

The defensibility compounds over time. The challenge library grows. The scoring model gets calibrated against more data. Benchmark norms by role and industry become the reference standard. It’s a content and data moat, not a technology moat.

I’m still in the design phase — figuring out challenge formats, scoring reliability, and what makes a great assessment versus a frustrating test. The big open question is whether hiring managers will trust a score like this enough to use it. That needs validation.

The problem, specifically

There’s no standardized way to measure how effectively someone uses AI tools. Certifications test knowledge recall (multiple choice about transformer architecture). Interviews ask “do you use AI?” which is useless signal. The gap between someone who copies and pastes into ChatGPT and someone who orchestrates multi-step agent workflows is enormous, and invisible to current evaluation methods.

This matters for three use cases: hiring (signal for AI-native candidates), training (pre/post measurement for upskilling programs), and education (certifiable AI literacy for students).

Architecture

Challenge Engine
  - Challenge library (categorized, difficulty-tiered, with variants)
  - Time-boxed sessions with standardized AI tooling
  - Full interaction capture: prompts, responses, tool switches, timing, edits

Scoring Engine
  - Output scoring: rubric-based, LLM-as-judge (partially automated) + human review
  - Process scoring: interaction log analysis across 6 dimensions
  - Composite AI Literacy Score: weighted output + process, percentile-normalized

Assessment API
  - Embed challenges in hiring pipelines (ATS integration)
  - Role-specific challenge packs
  - Candidate reports with dimension breakdown

Scoring model design

This is the core technical challenge. Two layers:

Output scoring is the easier problem. Each challenge has a rubric with objective completion criteria. LLM-as-judge evaluates against the rubric. For high-stakes assessments, human review validates. Target inter-rater agreement >= 0.85 (human vs. automated).

Process scoring is the hard part and the moat. From the interaction log, we derive:

DimensionSignals extracted from log
Problem decompositionDid they plan before prompting? Distinct sub-tasks identifiable in prompt sequence?
Prompt craftSpecificity, token efficiency, appropriate detail level per sub-task
Tool selectionDid they switch tools appropriately? Chat for ideation, code agent for implementation, search for facts?
Iteration & recoveryPrompts after bad output — did they re-approach or repeat? Time-to-correction. Dead end detection.
VerificationEvidence of checking output — follow-up validation prompts, manual edits correcting AI errors
OrchestrationMulti-step workflows, output of step N feeding step N+1, coherent strategy across interactions

Each dimension scored 0-100. The composite score is a weighted combination, percentile-normalized against the population.

Challenge design constraints

Challenges need to satisfy competing requirements:

  • Objective completion criteria (scoreable) but multiple valid approaches (don’t penalize creativity)
  • Process-revealing — challenge structure must force observable AI interaction, not just a final deliverable
  • Variant generation — same skill test, different surface problem, to prevent memorization
  • Model-agnostic — a skilled user should score well regardless of which frontier model is available
  • Difficulty tiered — Foundational / Intermediate / Advanced / Expert with clear discrimination at each level

Categories: Research & Synthesis, Data Analysis, Content & Communication, Code & Automation, Strategic Reasoning, Multimodal.

Anti-gaming

  • Process scoring penalizes patterns suggesting gaming: copy-pasting known solutions without reading AI output, mechanical prompt repetition
  • Challenge variants prevent memorization
  • Time-based anomaly detection for suspiciously fast completions
  • Challenges designed so brute-force prompting produces measurably worse process scores than strategic use

Standardized tooling (v1) vs. BYO (future)

V1 provides a controlled environment: chat AI (frontier model), code sandbox, web search, file upload. All interactions captured natively.

BYO is the future (tests real-world setup) but creates a scoring nightmare — screen recording + interaction capture across arbitrary tools. Deferred.

Key technical decisions

LLM-as-judge for output scoring. Using frontier models to evaluate challenge outputs against structured rubrics. The risk is hallucinated scores — mitigation is calibration against human-graded samples and flagging low-confidence evaluations for human review.

Interaction log as first-class data. Every prompt, response, tool switch, edit, and timing event is captured as structured data. This is the raw material for process scoring. Schema needs to be rich enough to reconstruct the candidate’s full decision-making sequence.

Percentile normalization. Raw dimension scores are meaningless without context. Normalization against the growing population makes scores interpretable (“this candidate is in the 85th percentile for prompt craft among PM candidates”).

What’s not built yet

This is at the spec stage. Key unknowns:

  • Scoring reliability. Can process scoring be consistent enough for hiring decisions? Needs a validation study: score N people, track real-world AI effectiveness, measure correlation.
  • Challenge shelf life. AI capabilities change fast. Challenges must test transferable meta-skills (decomposition, verification) not tool-specific tricks (prompt templates for GPT-4).
  • LLM-as-judge accuracy. What’s the human-agreement rate on output scoring? Needs calibration data.

The moat

The defensibility is in compounding data, not technology:

  1. Challenge library with validated rubrics (expensive to create and validate)
  2. Scoring calibration data (what interaction patterns correlate with real-world AI effectiveness?)
  3. Benchmark norms by role/industry/experience (the reference standard)
  4. Employer network effects (if employers adopt, candidates must use, generating more data)

Two kids building with Legos -- one organized, one chaotic. Same goal, different process.

A question for you

You know how in school, there are tests for math, reading, and science? You take the test, and it tells you how well you understand the subject.

But what about using AI? Lots of people say “I use AI all the time.” But some of them are really, really good at it, and some of them are… not. And right now, there’s no test for that. No way to tell who actually knows what they’re doing.

Think about it like this. Two people sit down to build a Lego set. They both finish it. But one person read the instructions, planned which pieces they needed, and built it step by step. The other person just grabbed random pieces, tried to force them together, had to tear it apart three times, and eventually got there by accident. They both finished — but they didn’t build the same way.

Using AI is the same. The result might look similar, but the process is completely different.

What if we could see the process?

That’s what I’m designing. It’s called AI Arena, and it works like this:

You get a real problem to solve — not a quiz with right or wrong answers, but something messy and interesting. Maybe you need to research a topic and write a summary. Maybe you need to analyze some data and find patterns. Maybe you need to build something.

You have a time limit, and you have AI tools to help you. How you use them is up to you.

Here’s the twist: the system watches how you work. Not just what you produce at the end, but everything in between. Did you think about the problem before diving in? When the AI gave you something wrong, did you notice? Did you try a different approach, or did you just keep asking the same question over and over?

Six colorful skill badges arranged like superhero shields.

Six superpowers

I figured out that being good at working with AI actually comes down to six skills. Think of them like superpowers:

Breaking things down. Can you look at a big confusing problem and split it into smaller, manageable pieces? This is like knowing which parts of a recipe to do first.

Asking good questions. The way you ask an AI to do something really matters. It’s like giving directions to someone — “go somewhere fun” is way less helpful than “take me to the park on 5th Street.”

Picking the right tool. AI can do lots of different things — answer questions, write code, search the internet, create images. Knowing which tool fits which job is a skill.

Fixing mistakes. AI gets things wrong sometimes. Can you tell when it’s wrong? What do you do next? Do you give up, or do you try a different way?

Checking the work. This is a big one. Do you just believe everything the AI tells you? Or do you double-check the important stuff? Smart AI users always verify.

Putting it all together. The most skilled people can make AI do several things in a row, where each step builds on the last one. Like a chain of dominoes — each one sets up the next.

Why this matters for you

Here’s something to think about: by the time you’re looking for your first job, knowing how to work with AI won’t be a bonus — it’ll be expected. Like knowing how to type or use a computer.

The cool thing is, these six skills aren’t just about AI. Breaking problems down, asking good questions, checking your work, fixing mistakes — those are thinking skills that help with everything.

So here’s a challenge for you right now: next time you use AI for homework or a project, pay attention to your own process. Are you just typing and hoping? Or are you thinking about what you need, asking clearly, and checking what comes back?

You might be better at this than you think. Or you might discover a superpower you haven’t practiced yet.