How Do You Know If Someone Is Actually Good at Using AI?
TLDR: Everyone says they use AI, but nobody can prove they’re good at it. I’m designing a platform that tests how well humans wield AI tools by capturing the full interaction process — not just the output, but how you got there. The scoring model is the moat.
The problem
Here’s a conversation I keep having. Someone says “I use AI every day” and I think — okay, but what does that mean? Are you copying and pasting into ChatGPT and accepting whatever comes back? Or are you breaking problems into pieces, choosing the right tool for each part, catching errors, and chaining steps together into something that actually works?
Those are wildly different skill levels, and right now there’s no way to tell them apart. Resumes can’t capture it. Interviews barely try. Certifications test whether you memorized facts about AI, not whether you can actually wield it.
This matters because AI literacy is becoming the baseline professional skill. Not knowing how to use AI effectively is like not knowing how to use a spreadsheet in 2005 — you can get by for a while, but the gap keeps growing. And employers, training programs, and schools all need a way to measure it.
What I’m designing
AI Arena is a platform that tests how well humans use AI to solve real problems. Not a quiz about what AI is — a hands-on assessment where you actually use AI tools under time pressure, and the system watches how you do it.
Here’s how it works: you get a problem (research brief, data analysis, content creation, coding task, strategic question). You have a time limit and access to AI tools. You solve it however you want. When you’re done, the system scores two things.
First, the output. Did you actually solve the problem? Is the work good? This part is straightforward.
Second — and this is where it gets interesting — the process. The system captures your full interaction with the AI. How many attempts did it take? Did you plan before prompting or just start typing? When the AI gave you something wrong, how fast did you notice? Did you verify the output or trust it blindly? Did you use the right tool for each sub-task?
The process score is the real insight. Two people might produce the same quality output, but one did it in 3 prompts with a clear strategy, and the other thrashed through 30 attempts. That difference is AI literacy.
Six things we’re measuring
I’ve broken AI literacy into six dimensions:
Problem decomposition. Can you look at a messy, ambiguous problem and break it into pieces that AI can actually help with? This is the strategic thinking layer — knowing what to delegate and what to do yourself.
Prompt craft. Not just “write a good prompt” but knowing when a detailed prompt matters, when a vague one is fine, and when prompting is the wrong approach entirely.
Tool selection. Chat, code agents, search, image generation — each has strengths. Skilled users pick the right tool. Unskilled users use chat for everything.
Iteration and recovery. AI gives bad output sometimes. How fast do you notice? How do you adjust? Do you re-prompt, try a different approach, or just accept garbage?
Verification and judgment. Do you check the work? Do you catch hallucinations? Do you know when to trust AI output and when to verify independently?
Orchestration. Can you chain multiple AI steps into a coherent workflow? This is the advanced skill — using AI not as a single tool but as a system.
Why this hasn’t been built before
Existing tools evaluate either the AI (Chatbot Arena) or traditional skills (LeetCode, Kaggle). Nobody evaluates the human-AI collaboration skill. The closest things are AI certifications, and those are multiple-choice knowledge tests — they tell you someone read about AI, not that they can use it.
The hard part is scoring the process. Output scoring is relatively straightforward (rubric + LLM-as-judge). But reliably scoring how someone interacted with AI — that requires capturing the full interaction log and having meaningful metrics for what “good” looks like. That scoring model is both the hardest part to build and the moat once it’s calibrated.
Where this goes
For hiring, it’s a practical signal: does this candidate actually know how to leverage AI, or are they just saying they do? For training programs, it’s pre/post measurement: did this workshop actually make people better? For individuals, it’s self-assessment: where are my gaps?
The defensibility compounds over time. The challenge library grows. The scoring model gets calibrated against more data. Benchmark norms by role and industry become the reference standard. It’s a content and data moat, not a technology moat.
I’m still in the design phase — figuring out challenge formats, scoring reliability, and what makes a great assessment versus a frustrating test. The big open question is whether hiring managers will trust a score like this enough to use it. That needs validation.

