Your team uses AI for code generation, refactoring, test writing, API design. Someone tries Claude, someone else tries the latest OpenAI model. One engineer says "it feels better," another disagrees. A few weeks later, the vendor pushes an update and suddenly the model is slower or produces lower quality code. You don't know when it happened. You just know your workflows feel different.
This is the state of most teams: no benchmark, no baseline, no way to know if a model is actually good for your work or if you're just getting lucky.
There's another problem: vendor benchmarks complicate the picture. OpenAI announces improvements on MMLU. Anthropic claims better performance on GSM8K. But in your shop, the newer model costs more and feels slower—or fails on problems the old one solved. Either the benchmarks measure something your team doesn't care about, or the numbers can't be trusted.
Here's why public benchmarks fail. MMLU, HumanEval, GSM8K—these tests are published on arXiv, discussed on Reddit, embedded in blog posts and training data discussions. Models trained on internet-scale data have almost certainly seen these benchmarks during training. Research on benchmark data contamination shows that when test questions appear in training data—whether deliberately or accidentally—models memorize them rather than learning the underlying concept. When a vendor reports "86.8% on MMLU," you have no way to know whether the model learned the concept or memorized the test.
This creates a perverse incentive. Once benchmarks become public targets, vendors optimize for them. Privately test dozens of model versions, then publish only the best results. Cherry-pick favorable conditions. The score improves. Real capability doesn't.
"Headline scores often measure how well a model gamed the test harness rather than how well it solved the underlying tasks."
Analysis of the Chatbot Arena leaderboard found that companies could boost reported scores by up to 112% through selective disclosure of privately-tested variants—what researchers call "reward hacking." The same study showed that 27 private LLM variants were tested by Meta before the Llama-4 release, with only the best results published. Kili Technology's research confirmed that organizations overestimate their models' real performance by 30% or more when they optimize for published benchmarks without measuring actual work.
There's no need to speculate: frontier model vendors are in a race, burning billions of dollars to lead the AI space. The incentives are obvious. When you see their self-published benchmark scores improve, would you blindly trust them? Or would you want to verify the improvements on your own work?
The answer is simple: create a test for your actual work, run it on different models multiple times, and compare the averages. Measure real work, compare the results, and use your judgment about how much confidence you need before deciding. This is exactly what OpenAI, Anthropic, and every serious team does. They run benchmarks on their actual problems, report the percentage, and move on. You should too. And you can use real problems from your codebase to do it.
What Is a Benchmark? (It's Just a Score)
A benchmark is simple: you have a set of problems, you run models on them multiple times using independent sessions, count how many the model solves each time, and average the results. That's your score.
Example:
- You have 50 refactoring tasks from your codebase
- You run Model A twice (two independent sessions) on all 50 problems
- Run 1: 42 pass your tests → 84%
- Run 2: 41 pass your tests → 82%
- Average score: 83%
Compare across models, for example:
- Model A: 83% (avg of 2 runs)
- Model B: 78% (avg of 2 runs)
Model A is better. But verify with more runs if this is a significant decision (see "Handling Variance" below).
For example, Anthropic publishes Claude Opus 4.7 at 92.8% on MMLU, and OpenAI reports GPT-5.5 at 92.4% on MMLU. They run benchmarks, report the percentage, use the score to guide decisions. Your internal benchmarks follow the exact same methodology—but applied to your actual problems, not public benchmarks. (These are examples only; your benchmark will measure real work unique to your codebase.)
The benchmark doesn't have to be fancy. It just has to be:
- Real problems (from your codebase, not made up)
- Measurable (you have a way to check pass/fail)
- Verifiable (you can run it again and get results consistent enough to be useful)
How to Build a Benchmark: 4 Steps
Now that you understand what you're protecting yourself against, here's how to implement it:
- Find real problems you've already solved
- Write programmatic checks
- Run models and score
- Build your benchmark
Step 1: Find Real Problems You've Already Solved
Look at recent work your team completed:
- A refactoring you shipped last month
- A feature you built
- An API migration
- A legacy code cleanup
Pick one. Get the before code and the after code (your solution).
Step 2: Write Programmatic Checks
For the "before" code, define what "correct" means. Examples:
API design task: You asked the model to add a new endpoint to an API.
- Check: Does the schema include all required fields? (programmatic validation)
- Check: Does it follow your naming conventions? (regex or linter)
- Check: Do the resolvers compile and return the right type? (type checking or runtime test)
- Pass if: All checks pass. Fail if: Any check fails.
Refactoring task: You asked the model to refactor a payment module.
- Check: Does the refactored code pass the existing test suite? (run tests)
- Check: Does it pass the linter? (run linter)
- Check: Are the function signatures compatible with call sites? (check imports/exports)
- Pass if: Tests pass + linter passes. Fail if: Tests fail or linter fails.
Feature implementation: You asked the model to implement a feature.
- Check: Does it pass your test suite? (run tests)
- Check: Does it match the spec you provided? (manual checklist or automated checks)
- Pass if: Tests pass + spec matches. Fail if: Tests fail or spec doesn't match.
The key: make the check programmatic. Tests, linters, type checkers, schema validators. No subjective judgment. Binary: pass or fail.
Step 3: Run Models and Score
Give both the "before" code and the problem description to your models. Run each problem at least twice using independent sessions (different agent sessions, not reused). Here's what this looks like:
Run 1 (Model A): Passes 42/50 tests → 84% Run 2 (Model A): Passes 41/50 tests → 82% Model A Average: 83%
Run 1 (Model B): Passes 35/50 tests → 70% Run 2 (Model B): Passes 36/50 tests → 72% Model B Average: 71%
Model A is better. For prototype decisions with clear differences like this (83% vs 71%), two independent runs are your minimum. For significant decisions (vendor selection, major spending), see "Handling Variance" below for guidance on when to run more times.
Step 4: Build Your Benchmark
Once you have one working test, add more:
- Extract 5–10 real problems your team has solved
- Write programmatic checks for each
- Create a script that runs all models on all problems and scores them
- Run it monthly, quarterly, or whenever you need to compare models
Sample size: 50+ problems is solid. 20 is workable to start. 300+ is excellent if you have the time.
What Counts as a "Pass"?
Whatever you define. Examples:
Code generation: Passes your test suite. Code compiles. Follows style guide.
Refactoring: Existing tests pass. New functionality works. No breaking changes.
API design: Schema validates. Resolvers compile. Matches spec.
Feature implementation: Unit tests pass. Integration tests pass. Matches requirements.
Database migration: Runs without errors. Data integrity checks pass. Performance is acceptable.
Don't overthink it. Make it automated. Binary: works or doesn't work.
Real Examples
Refactoring Benchmark
You refactored a legacy payment module. Test it:
- Extract the before/after code from your repo
- Give the model the before code + problem statement ("Refactor to support new fee structures")
- Check: Does the refactored code pass existing tests? (test suite)
- Check: Does it pass linting? (linter)
- Check: Does it handle the new fee structure? (unit test)
- Pass: All checks pass
- Fail: Any check fails
API Design Benchmark
You migrated a REST API to GraphQL. Test it:
- Give the model the old REST schema + migration spec
- Check: Does the GraphQL schema validate? (GraphQL validator)
- Check: Are all required fields present? (schema inspection)
- Check: Do the resolvers return the right types? (type checking)
- Pass: Schema valid + all fields present + types correct
- Fail: Schema invalid or missing fields or type errors
Feature Implementation Benchmark
You built a feature. Test it:
- Give the model the requirements + existing codebase
- Check: Does the code pass your test suite? (test runner)
- Check: Does it compile? (compiler/linter)
- Check: Does it follow conventions? (linter for style)
- Pass: Tests pass + compiles + style clean
- Fail: Tests fail or doesn't compile or style violations
That's the Core Benchmark
You now have an objective way to score models. Compare them. Share the score with your team. Switch models based on data, not gut feel.
This is 95% of what you need. OpenAI and Anthropic do exactly this—report a percentage and move on.
Handling Variance: When to Run Multiple Times
Models can produce slightly different outputs between runs. You must run multiple times to get reliable results.
The baseline: Run twice, minimum. Use independent sessions for each run (different agent sessions, not the same one repeated). If you reuse the same session, you introduce state bias from token history and model memory. Results from reused sessions are invalid and should be discarded. Average the two scores.
Check your variance:
- Is the variance small? (same score both times, or within 1–2%)
- Is the difference between models large? (e.g., 85% vs 70%)
If yes to both AND you're making a low-stakes decision (prototype, internal tooling), two runs might be sufficient. If no—if variance is large, models are close, or this is a significant decision (vendor selection, major spending)—run more trials.
Anthropic recommends running evaluations multiple times—in their Claude 3 evaluations, they averaged results over 5 trials per problem. HumanEval, the standard code generation benchmark, uses pass@10 or pass@100 for stochastic sampling.
How many runs beyond two? That depends on your risk tolerance and budget. The tradeoff is real: more runs = higher confidence, but also more tokens and more time. Context matters:
- Prototype/internal: Two runs (with independent sessions) is your minimum
- Department-level decision: 5 runs and compare variance
- $50K+/month vendor decision: 10+ runs, report confidence intervals
Two non-negotiable rules:
- Independent sessions only. Reusing the same session introduces state bias. If you violate this, discard those results.
- Same number of trials for each model. Comparing Model A with 10 runs against Model B with 3 runs is invalid and misleading.
Should You Only Test New Models?
No. Once you have your benchmark, keep it. Models change over time—sometimes they improve, sometimes they degrade. And vendors don't always announce changes.
A real example: In early 2026, Claude Opus 4.6 changed without notice. Analysis of 6,852 real usage sessions detected median reasoning depth dropped 73% between January and March. Anthropic published a postmortem explaining three infrastructure bugs: context window routing errors (16% of requests), TPU misconfigurations, and compiler bugs in token probability calculations. Teams discovered the change because their work started failing differently—not because they were notified.
If your team uses a model in production, run your benchmark every quarter (or whenever workflows feel different). If your score changes significantly, you know something shifted. You have data instead of guesswork.
This is where benchmarks save time: one caught change prevents weeks of "why is this different now?" investigation.
Advanced: If You Want to Go Deeper
There are optional refinements if you want more statistical rigor. Most teams don't need these, but they exist.
If You Want Statistical Rigor (For Published Results)
The baseline two-run approach works for most internal decisions ("which model should we use?"). But if you're publishing your results or want confidence intervals around your score, Anthropic recommends 5–10 runs per problem. Average the results and report the standard error of the mean (SEM) alongside your score. When to use it: You're publishing benchmark results and want to claim statistical significance. Cost: 5–10× more API calls than the two-run baseline. Skip this if: You're comparing models internally and your decision has low stakes.
If You're Measuring the Impact of Major Refactors
Say you're debating TypeScript strict mode. You could run your benchmark before and after, then see: "With strict mode: 85%. Without strict mode: 78%." This tells you the impact of the refactor on AI performance. When to use it: You're investing in a major technology change and want to quantify the AI payoff. Cost: Setup effort (before/after benchmarks, controlled variables). Skip this if: You're just trying to pick a model—this is for tech strategy decisions.
If You're Relying on Published Research
You read that "TypeScript improves code quality 15%." Before implementing TypeScript company-wide, verify it applies to your scenario. They might have tested "strict TypeScript + ESLint + Prettier," while you're considering "loose TypeScript only." Their 15% might not apply. Run your own benchmark before committing. When to use it: You're making a large organizational change and external research drove the decision. Cost: Build a before/after benchmark. Skip this if: You're just picking a model or comparing minor changes.
Keep Your Benchmark Private
One last thing: don't publish the test cases themselves. Publish the score (85% vs 72%), publish your methodology (how you tested), but keep the actual problems private.
Why? Because once the test cases are public, models start training on them. As research shows, public benchmarks become contaminated quickly.
"High scores on static benchmarks can be deeply misleading, and inflated leaderboard scores can be uncorrelated with the underlying capabilities."
Your private benchmark is your competitive advantage—it measures real capability without the noise.
Publishing your results is great. Sharing insights is great. Publishing the test cases is how you kill your benchmark.
Why This Matters
You're spending tens of thousands of dollars monthly on AI. Models change without notice—sometimes they improve, sometimes they degrade. Vendors don't always announce what shifted. Your team wastes time on a slower or lower-quality model because nobody measured.
A simple benchmark—50 real problems, scored on pass/fail—changes that. It takes a few hours to set up. It pays for itself the first month you catch a change or switch to a cheaper model that actually works.
Do it. Your bottom line will thank you.
