Skip to main content

AI benchmarks · evals · model evaluation

How to Build Your Own AI Benchmark (And Why It's Critical)

Public benchmarks don't tell you if models work for your codebase. Build a simple scoring system from real problems: extract solved code, write programmatic checks, test models, get a percentage score. This is what OpenAI and Anthropic do.

By Nicolas Bouvrette13 min read

Your team uses AI for code generation, refactoring, test writing, API design. Someone tries Claude, someone else tries the latest OpenAI model. One engineer says "it feels better," another disagrees. A few weeks later, the vendor pushes an update and suddenly the model is slower or produces lower quality code. You don't know when it happened. You just know your workflows feel different.

This is the state of most teams: no benchmark, no baseline, no way to know if a model is actually good for your work or if you're just getting lucky.

There's another problem: vendor benchmarks complicate the picture. OpenAI announces improvements on MMLU. Anthropic claims better performance on GSM8K. But in your shop, the newer model costs more and feels slower—or fails on problems the old one solved. Either the benchmarks measure something your team doesn't care about, or the numbers can't be trusted.

Here's why public benchmarks fail. MMLU, HumanEval, GSM8K—these tests are published on arXiv, discussed on Reddit, embedded in blog posts and training data discussions. Models trained on internet-scale data have almost certainly seen these benchmarks during training. Research on benchmark data contamination shows that when test questions appear in training data—whether deliberately or accidentally—models memorize them rather than learning the underlying concept. When a vendor reports "86.8% on MMLU," you have no way to know whether the model learned the concept or memorized the test.

This creates a perverse incentive. Once benchmarks become public targets, vendors optimize for them. Privately test dozens of model versions, then publish only the best results. Cherry-pick favorable conditions. The score improves. Real capability doesn't.

"Headline scores often measure how well a model gamed the test harness rather than how well it solved the underlying tasks."
— Kili Technology, Custom AI Benchmark Guide

Analysis of the Chatbot Arena leaderboard found that companies could boost reported scores by up to 112% through selective disclosure of privately-tested variants—what researchers call "reward hacking." The same study showed that 27 private LLM variants were tested by Meta before the Llama-4 release, with only the best results published. Kili Technology's research confirmed that organizations overestimate their models' real performance by 30% or more when they optimize for published benchmarks without measuring actual work.

There's no need to speculate: frontier model vendors are in a race, burning billions of dollars to lead the AI space. The incentives are obvious. When you see their self-published benchmark scores improve, would you blindly trust them? Or would you want to verify the improvements on your own work?

The answer is simple: create a test for your actual work, run it on different models multiple times, and compare the averages. Measure real work, compare the results, and use your judgment about how much confidence you need before deciding. This is exactly what OpenAI, Anthropic, and every serious team does. They run benchmarks on their actual problems, report the percentage, and move on. You should too. And you can use real problems from your codebase to do it.

What Is a Benchmark? (It's Just a Score)

A benchmark is simple: you have a set of problems, you run models on them multiple times using independent sessions, count how many the model solves each time, and average the results. That's your score.

Example:

  • You have 50 refactoring tasks from your codebase
  • You run Model A twice (two independent sessions) on all 50 problems
  • Run 1: 42 pass your tests → 84%
  • Run 2: 41 pass your tests → 82%
  • Average score: 83%

Compare across models, for example:

  • Model A: 83% (avg of 2 runs)
  • Model B: 78% (avg of 2 runs)

Model A is better. But verify with more runs if this is a significant decision (see "Handling Variance" below).

For example, Anthropic publishes Claude Opus 4.7 at 92.8% on MMLU, and OpenAI reports GPT-5.5 at 92.4% on MMLU. They run benchmarks, report the percentage, use the score to guide decisions. Your internal benchmarks follow the exact same methodology—but applied to your actual problems, not public benchmarks. (These are examples only; your benchmark will measure real work unique to your codebase.)

The benchmark doesn't have to be fancy. It just has to be:

  1. Real problems (from your codebase, not made up)
  2. Measurable (you have a way to check pass/fail)
  3. Verifiable (you can run it again and get results consistent enough to be useful)

How to Build a Benchmark: 4 Steps

Now that you understand what you're protecting yourself against, here's how to implement it:

  1. Find real problems you've already solved
  2. Write programmatic checks
  3. Run models and score
  4. Build your benchmark

Step 1: Find Real Problems You've Already Solved

Look at recent work your team completed:

  • A refactoring you shipped last month
  • A feature you built
  • An API migration
  • A legacy code cleanup

Pick one. Get the before code and the after code (your solution).

Step 2: Write Programmatic Checks

For the "before" code, define what "correct" means. Examples:

API design task: You asked the model to add a new endpoint to an API.

  • Check: Does the schema include all required fields? (programmatic validation)
  • Check: Does it follow your naming conventions? (regex or linter)
  • Check: Do the resolvers compile and return the right type? (type checking or runtime test)
  • Pass if: All checks pass. Fail if: Any check fails.

Refactoring task: You asked the model to refactor a payment module.

  • Check: Does the refactored code pass the existing test suite? (run tests)
  • Check: Does it pass the linter? (run linter)
  • Check: Are the function signatures compatible with call sites? (check imports/exports)
  • Pass if: Tests pass + linter passes. Fail if: Tests fail or linter fails.

Feature implementation: You asked the model to implement a feature.

  • Check: Does it pass your test suite? (run tests)
  • Check: Does it match the spec you provided? (manual checklist or automated checks)
  • Pass if: Tests pass + spec matches. Fail if: Tests fail or spec doesn't match.

The key: make the check programmatic. Tests, linters, type checkers, schema validators. No subjective judgment. Binary: pass or fail.

Step 3: Run Models and Score

Give both the "before" code and the problem description to your models. Run each problem at least twice using independent sessions (different agent sessions, not reused). Here's what this looks like:

Run 1 (Model A): Passes 42/50 tests → 84% Run 2 (Model A): Passes 41/50 tests → 82% Model A Average: 83%

Run 1 (Model B): Passes 35/50 tests → 70% Run 2 (Model B): Passes 36/50 tests → 72% Model B Average: 71%

Model A is better. For prototype decisions with clear differences like this (83% vs 71%), two independent runs are your minimum. For significant decisions (vendor selection, major spending), see "Handling Variance" below for guidance on when to run more times.

Step 4: Build Your Benchmark

Once you have one working test, add more:

  • Extract 5–10 real problems your team has solved
  • Write programmatic checks for each
  • Create a script that runs all models on all problems and scores them
  • Run it monthly, quarterly, or whenever you need to compare models

Sample size: 50+ problems is solid. 20 is workable to start. 300+ is excellent if you have the time.

What Counts as a "Pass"?

Whatever you define. Examples:

Code generation: Passes your test suite. Code compiles. Follows style guide.

Refactoring: Existing tests pass. New functionality works. No breaking changes.

API design: Schema validates. Resolvers compile. Matches spec.

Feature implementation: Unit tests pass. Integration tests pass. Matches requirements.

Database migration: Runs without errors. Data integrity checks pass. Performance is acceptable.

Don't overthink it. Make it automated. Binary: works or doesn't work.

Real Examples

Refactoring Benchmark

You refactored a legacy payment module. Test it:

  1. Extract the before/after code from your repo
  2. Give the model the before code + problem statement ("Refactor to support new fee structures")
  3. Check: Does the refactored code pass existing tests? (test suite)
  4. Check: Does it pass linting? (linter)
  5. Check: Does it handle the new fee structure? (unit test)
  • Pass: All checks pass
  • Fail: Any check fails

API Design Benchmark

You migrated a REST API to GraphQL. Test it:

  1. Give the model the old REST schema + migration spec
  2. Check: Does the GraphQL schema validate? (GraphQL validator)
  3. Check: Are all required fields present? (schema inspection)
  4. Check: Do the resolvers return the right types? (type checking)
  • Pass: Schema valid + all fields present + types correct
  • Fail: Schema invalid or missing fields or type errors

Feature Implementation Benchmark

You built a feature. Test it:

  1. Give the model the requirements + existing codebase
  2. Check: Does the code pass your test suite? (test runner)
  3. Check: Does it compile? (compiler/linter)
  4. Check: Does it follow conventions? (linter for style)
  • Pass: Tests pass + compiles + style clean
  • Fail: Tests fail or doesn't compile or style violations

That's the Core Benchmark

You now have an objective way to score models. Compare them. Share the score with your team. Switch models based on data, not gut feel.

This is 95% of what you need. OpenAI and Anthropic do exactly this—report a percentage and move on.

Handling Variance: When to Run Multiple Times

Models can produce slightly different outputs between runs. You must run multiple times to get reliable results.

The baseline: Run twice, minimum. Use independent sessions for each run (different agent sessions, not the same one repeated). If you reuse the same session, you introduce state bias from token history and model memory. Results from reused sessions are invalid and should be discarded. Average the two scores.

Check your variance:

  • Is the variance small? (same score both times, or within 1–2%)
  • Is the difference between models large? (e.g., 85% vs 70%)

If yes to both AND you're making a low-stakes decision (prototype, internal tooling), two runs might be sufficient. If no—if variance is large, models are close, or this is a significant decision (vendor selection, major spending)—run more trials.

Anthropic recommends running evaluations multiple times—in their Claude 3 evaluations, they averaged results over 5 trials per problem. HumanEval, the standard code generation benchmark, uses pass@10 or pass@100 for stochastic sampling.

How many runs beyond two? That depends on your risk tolerance and budget. The tradeoff is real: more runs = higher confidence, but also more tokens and more time. Context matters:

  • Prototype/internal: Two runs (with independent sessions) is your minimum
  • Department-level decision: 5 runs and compare variance
  • $50K+/month vendor decision: 10+ runs, report confidence intervals

Two non-negotiable rules:

  1. Independent sessions only. Reusing the same session introduces state bias. If you violate this, discard those results.
  2. Same number of trials for each model. Comparing Model A with 10 runs against Model B with 3 runs is invalid and misleading.

Should You Only Test New Models?

No. Once you have your benchmark, keep it. Models change over time—sometimes they improve, sometimes they degrade. And vendors don't always announce changes.

A real example: In early 2026, Claude Opus 4.6 changed without notice. Analysis of 6,852 real usage sessions detected median reasoning depth dropped 73% between January and March. Anthropic published a postmortem explaining three infrastructure bugs: context window routing errors (16% of requests), TPU misconfigurations, and compiler bugs in token probability calculations. Teams discovered the change because their work started failing differently—not because they were notified.

If your team uses a model in production, run your benchmark every quarter (or whenever workflows feel different). If your score changes significantly, you know something shifted. You have data instead of guesswork.

This is where benchmarks save time: one caught change prevents weeks of "why is this different now?" investigation.

Advanced: If You Want to Go Deeper

There are optional refinements if you want more statistical rigor. Most teams don't need these, but they exist.

If You Want Statistical Rigor (For Published Results)

The baseline two-run approach works for most internal decisions ("which model should we use?"). But if you're publishing your results or want confidence intervals around your score, Anthropic recommends 5–10 runs per problem. Average the results and report the standard error of the mean (SEM) alongside your score. When to use it: You're publishing benchmark results and want to claim statistical significance. Cost: 5–10× more API calls than the two-run baseline. Skip this if: You're comparing models internally and your decision has low stakes.

If You're Measuring the Impact of Major Refactors

Say you're debating TypeScript strict mode. You could run your benchmark before and after, then see: "With strict mode: 85%. Without strict mode: 78%." This tells you the impact of the refactor on AI performance. When to use it: You're investing in a major technology change and want to quantify the AI payoff. Cost: Setup effort (before/after benchmarks, controlled variables). Skip this if: You're just trying to pick a model—this is for tech strategy decisions.

If You're Relying on Published Research

You read that "TypeScript improves code quality 15%." Before implementing TypeScript company-wide, verify it applies to your scenario. They might have tested "strict TypeScript + ESLint + Prettier," while you're considering "loose TypeScript only." Their 15% might not apply. Run your own benchmark before committing. When to use it: You're making a large organizational change and external research drove the decision. Cost: Build a before/after benchmark. Skip this if: You're just picking a model or comparing minor changes.

Keep Your Benchmark Private

One last thing: don't publish the test cases themselves. Publish the score (85% vs 72%), publish your methodology (how you tested), but keep the actual problems private.

Why? Because once the test cases are public, models start training on them. As research shows, public benchmarks become contaminated quickly.

"High scores on static benchmarks can be deeply misleading, and inflated leaderboard scores can be uncorrelated with the underlying capabilities."
— Kili Technology, Custom AI Benchmark Guide

Your private benchmark is your competitive advantage—it measures real capability without the noise.

Publishing your results is great. Sharing insights is great. Publishing the test cases is how you kill your benchmark.

Why This Matters

You're spending tens of thousands of dollars monthly on AI. Models change without notice—sometimes they improve, sometimes they degrade. Vendors don't always announce what shifted. Your team wastes time on a slower or lower-quality model because nobody measured.

A simple benchmark—50 real problems, scored on pass/fail—changes that. It takes a few hours to set up. It pays for itself the first month you catch a change or switch to a cheaper model that actually works.

Do it. Your bottom line will thank you.

Frequently asked questions

How do I know if a model is good enough for my team?
Test it on your actual problems. If you can solve a task (refactoring, API design, test writing), you can turn it into a benchmark. Test the model on the same problem and see if it matches your solution.
Do I need to understand statistics to build a benchmark?
No. You just need a way to check if the model got it right. Run 50 problems twice with independent sessions, count the passes each time, average the results. That's your score. If Model A averages 85% and Model B averages 72%, Model A is better. For close scores, run more times to verify.
What counts as a pass?
Whatever you define. Code that passes your test suite? Yes, pass. Schema that includes all required fields? Yes, pass. Implementation that matches your style guide? Yes, pass. You decide the criteria—make it programmatic and repeatable.
Why do I get different scores when I run the benchmark again?
Models can produce slightly different outputs between runs. This is normal. Always use independent sessions (separate agent sessions, not reused)—reusing the same session introduces state bias and invalidates results. Run the benchmark twice minimum using independent sessions, then average. If variance is small and models show a clear winner, you have your answer. If variance is large or models are close, run 5–10 trials and average. How much rigor you need depends on what decision you're making and what the cost of being wrong would be.
How many times should I run the benchmark?
Minimum: twice, using independent sessions, then average. For low-stakes decisions (prototype, internal testing), two runs might be enough if variance is small. For significant decisions ($50K+/month vendor selection), run 5–10 trials and compare variance. This is what Anthropic does in their evaluations. Non-negotiable rules: (1) Use independent sessions—reusing the same session biases results; (2) Run the same number of trials for each model so comparisons are fair. More runs = higher confidence, but also more tokens and time. Choose based on your stakes and budget.
Should I publish my benchmark?
Publish your results and methodology (the score, how you tested). Keep the actual test cases private—they're your competitive advantage. Share the insights, not the test data.
How many test cases do I need?
Start with 10–20. Get to 50+ if you want a reliable score. More tests = more confidence in the result. If you have 300 tests, even better.