Apr 8, 2026

Evaluation Dataset Design for Frontier Models

In the previous posts, we covered how LLMs are trained. This post covers how we evaluate them.

Benchmarks are the primary way AI progress is measured. When an AI developer announces a new model, they cite benchmark scores. VCs evaluating AI startups consider the same scores, as do governments deliberating AI policy.

Constructing a benchmark that produces a meaningful number is challenging. The benchmark needs to be designed so that the score it produces corresponds to the capability it claims to measure. This sounds obvious, but it’s where many benchmarks fail.

This post covers principles of benchmark design for capability benchmarks (as opposed to alignment benchmarks) through three examples: Humanity’s Last Exam (HLE), OSWorld-Verified, and IDE-Bench.

Each benchmark is very different. HLE is a benchmark that contains 2,500 questions measuring advanced and often obscure scientific knowledge. The 369 tasks in OSWorld-Verified involve an agent operating a real Linux computer, ranging from editing spreadsheets to navigating websites. IDE-Bench evaluates AI software engineering agents on 80 tasks in private codebases.

Each benchmark is in a different part of the design space, and the developers made unlikely trade-offs. By looking at what they actually did rather than what benchmark design should look like in the abstract, we can see five principles that show up across all three:

The benchmark should resist being leaked into model training data.
The way solutions are scored determines the scope of the benchmark.
The difficulty of tasks has to be calibrated against the frontier models.
The field's growth has moved from reporting individual scores to more sophisticated answer scoring metrics.
Maintenance of a benchmark is important.

Each of these comes with a tradeoff, and the choices a benchmark team makes early shape what the benchmark can and cannot measure later.

Contamination resistance comes first

A benchmark that has leaked into the training data of models measures how well they memorized it, not their capability. We discussed this in an earlier post on filtering and decontamination: even careful model developers find that benchmark questions sneak into pre-training corpora, and it is difficult to remove them. From the benchmark designer’s side, this means contamination resistance can’t be left to the model developers. It has to be built into the benchmark from the start.

The three benchmarks took different routes, and the route each team chose constrained almost everything else about the benchmark.

IDE-Bench, which evaluates the performance of model coding abilities in complex codebases, took the strictest path. Seven of its eight repositories forming evaluation environments were built from scratch by the team and never published elsewhere. Models could not have seen this code during training because it did not exist on the internet. The cost is that the benchmark is small, with only 80 tasks. Further, their language coverage was what the team had the bandwidth to build. Languages like Go and Rust, and large categories like mobile development, are absent. A private-codebase strategy provides full contamination protection but caps your scope at what engineers can build.

HLE chose a different route: keep the questions public but make them adversarial. A question only entered HLE if frontier models at that time failed on it. This filter performs two tasks simultaneously. It ensures difficulty, which will be discussed later, and it filters out anything models already know. This helps with contamination resistance. HLE also keeps a private held-out set, so if public questions are leaked into the training corpus of a model, the developers would see models doing much better on these public questions compared to their private ones. Additionally, the authors ran a post-hoc audit of questions using models with web search to remove any questions found through retrieval.

OSWorld-Verified takes a third route. A task on this benchmark might be “open this spreadsheet, sort the data by column B, and export it as CSV” or “find the cheapest flight from New York to Tokyo on this booking website and add it to the cart.” The agent sees screenshots and controls the mouse and keyboard, just as a human user would.

This setup obtains contamination resistance for almost no cost. The agent isn’t answering questions from a fixed dataset; it’s interacting with live software. You can’t memorize the current state of a booking website because the website keeps changing, or the right sequence of clicks for a spreadsheet task, because the spreadsheet is generated fresh each time. Even if a task description leaked into training data, the model would still require active interaction with the computer.

The cost shows up elsewhere. The same volatility that prevents memorization also breaks the evaluation in ways we’ll cover in the maintenance section.

The way solutions are scored shapes what the benchmark can measure

Once you have tasks, you need a way to grade the model’s solutions. This sounds like an implementation detail, but it is not. The grading mechanism determines the kinds of tasks the benchmark can include. A benchmark can only test capabilities that the developers can verify the answers reliably and at scale.

There’s a hierarchy of grading mechanisms, each with its own ceiling. At the simplest end is an exact match: the model produces an answer, and you compare it to a known correct answer. This is cheap and unambiguous, but it only works when the answer can be expressed as a short string. The next step is called a test suite execution. The model produces code, and you run a set of tests against it. It works for software tasks, but requires correctness to be expressed as tests. The most expressive choice is having a custom evaluator w that checks the specific properties of each final model configuration for tasks. This is flexible but expensive to build and fragile to maintain. There is also an option to utilize an LLM to grade the answers, but this method is far less reliable than the others.

HLE chose an exact match. Every question in HLE has a known, unambiguous answer that can be checked automatically. A quarter of the questions are multiple choice, and the rest are short-answer, exact-match questions. This choice enabled HLE to scale up to 2,500 questions across dozens of subjects without needing human graders for every model evaluation. The cost is that HLE can only ask questions where a single right answer exists. A proof with the correct conclusion but flawed intermediate steps gets full credit. A long-form explanation that only gets the gist right but uses different phrasing gets no credit. HLE measures whether the model knows the answer, not whether its reasoning.

IDE-Bench chose test suite execution combined with comparison against a golden patch. The model edits the code, and the harness runs the repository’s test suite. A successful task is one where the tests pass. This works because every task in IDE-Bench was written with tests in mind from the start. The benchmark cannot include tasks where correctness can’t be expressed as tests, like pure refactoring with no behavioral change, or tasks where multiple very different approaches would all be acceptable.

OSWorld-Verified chose the most expressive option. Each of its tasks has a custom evaluator that checks specific properties of the final state. Examples of checks include the following: Did the cookie get deleted? Does the spreadsheet contain the right values in the right cells? Is the file saved in the right format? This expressiveness enables OSWorld to cover the breadth of computer-use tasks. However, it comes at a cost we’ll see clearly in the maintenance section: the OSWorld team spent two months and ten people fixing 300+ evaluator issues that the community had reported.

The pattern across the three benchmarks is the same. More expressive grading facilitates covering more realistic tasks, making the benchmark more fragile.

Difficulty has to be calibrated against the frontier

A benchmark that frontier models solve at 95% voids usefulness, and the remaining 5% is noise. The gap between two strong models that score 90+% tells you more about which questions are mislabeled than about which model is better. MMLU is the cautionary example. When it was released in 2020, the best model scored 37%. By 2024, frontier models were above 90%, and the benchmark stopped distinguishing between models that were obviously different in capability.

A benchmark designer trying to avoid this fate has to choose how to calibrate difficulty against current frontier models. The three benchmarks made very different choices.

HLE has the most aggressive calibration mechanism. The submission process itself is an adversarial filter: a question is only accepted if frontier models fail on it. Specifically, the submitted question was tested against several models. If the models were correct, or in multiple-choice questions, did better than random guessing, the question was rejected. Then, the question went through expert peer review. The result was, by construction, that every question in HLE is at the frontier of model capability at the moment of inclusion.

This is a strong design choice that has a subtle failure mode. The adversarial filter selects questions that current models fail on, but doesn’t distinguish between questions testing significant versus obscure information. An independent investigation suggested around 30% of HLE answers in text-only chemistry and biology questions might be wrong. Some of this is normal benchmark error, but it is also partially a consequence of the adversarial filter selecting for questions near the edge of what experts themselves agree on. When you optimize for “models fail on this,” you end up with questions where the right answer is contested.

IDE-Bench takes a softer approach. The developers selected real engineering tasks they considered representative of professional software work, then ran frontier models against them during curation. They didn’t filter out tasks that models could solve, but instead ensured the whole set had headroom. There are tasks that no top model consistently solves, and the team flagged these as concrete targets for future progress. The result is a benchmark where difficulty is real but not artificially constructed. The downside is that some tasks are easy for frontier models, and the average score among top models is high enough that the benchmark may saturate within a year or two.

OSWorld-Verified inherits its difficulty from the world. Computer use is challenging for current models for many reasons. GUI interaction, multi-step planning, and mitigating errors are all weaknesses. The team didn’t engineer difficulty; they took real tasks from real applications and let the natural complexity of operating a computer set the bar. The current best system, Claude Mythos, achieves 79% on OSWorld-Verified, which is on par with human performance.

The three approaches map a tradeoff. HLE’s adversarial filter guarantees headroom, but at the cost of selecting for the obscure. IDE-Bench’s representativeness preserves the meaning of tasks but risks saturation. OSWorld’s reliance on real-world complexity gives organic difficulty, but doesn’t let the team control where it lives.

A single score is not enough

The traditional method of reporting benchmark results is one number per model. Model A scores 87%, Model B scores 82%, so Model A is better. It’s easy to interpret, and because complex metrics can be confusing, it is how most benchmark papers in the early years of LLMs reported results.

The problem is that a single number hides too much. Two models with the same accuracy can fail in completely different ways, take drastically different amounts of compute, or produce results with very different reliability. For deployment, these differences matter as much as the headline number. The three following benchmarks illustrate how the field is advancing toward comprehensive reporting.

HLE reports calibration errors alongside accuracy. Calibration measures whether a model’s stated confidence matches its actual accuracy. A well-calibrated model that says it is 80% confident should be right 80% of the time. HLE’s authors found that frontier models on their benchmark at the time of publication had calibration errors above 80%. This meant that models were extremely confident even when they were wrong. Recent models have improved in this regard. This is a different kind of failure from low accuracy. A model that knows what it lacks knowledge in is much more useful in practice than one that confidently invents answers. Reporting calibration alongside accuracy makes this distinction visible.

IDE-Bench goes further. The benchmark reports “pass@1” (whether it succeeded on the first attempt) and “pass@5” (whether it succeeded at least once in 5 attempts), but it also breaks down model behavior into a failure taxonomy. They categorize each unsuccessful run into one of several failure modes. This includes premature editing (the model starts modifying code before understanding the codebase), thrashing (the model keeps backtracking), context loss (the model loses track of what it was doing), tool call failures, syntax error loops, and timeouts.

This taxonomy reveals something a single score would hide: open-weight models and frontier models fail in qualitatively different ways. Open-weight models fail in premature editing in 80-95% of their failures. This suggests that they don’t spend enough time understanding before acting. Frontier models fail through different means: they might constantly switch between approaches or forget things from their context, suggesting they understand the problem but can’t reliably execute on it. The same overall pass@5 numbers can mean very different things about deployment readiness.

IDE-Bench also reports token efficiency (pass@5 divided by tokens per success), which differentiates models that solve tasks efficiently from models that solve them by consuming excessive compute.

OSWorld-Verified focuses on performance tiers and gap-to-human alongside the leaderboard score. The team emphasized in their blog post that the headline number is less informative than the variation in performance across task categories. Some categories show dramatic improvement with newer models; others remain stuck. A team selecting a model for a specific deployment prioritizes categories that align with their use case, rather than the average.

The pattern across all three is that single-number reporting creates perverse optimization pressure. If pass@5 only matters, labs will optimize for pass@5 even when it degrades reliability, calibration, or token efficiency. Multi-metric reporting preserves the space of capabilities worth caring about and makes it harder to game the benchmark by sacrificing things that aren’t measured.

Maintenance is part of the artifact

A benchmark is not a paper that gets published and then sits on a shelf. It is a piece of infrastructure that the community uses to evaluate models for years after its release. In those years, the world changes around it. Models improve. Websites get redesigned. Bugs in the original tasks get discovered. If the benchmark team doesn’t keep up, the score gradually loses its initial value.

This is the principle hardest to appreciate from a single research paper because the maintenance work happens after the paper is published. The OSWorld team is the clearest example because they’ve been transparent about it.

OSWorld was first released in April 2024. The team had spent over 400 person-hours on quality checks before release and continued to invest hundreds more in the months after. Despite this, by mid-2025, they had collected approximately 300 issues from institutions running the benchmark: Moonshot AI, OpenAI, Anthropic, and others. These weren’t superficial bugs. Websites had changed their HTML structure, breaking evaluation functions. Booking sites had introduced CAPTCHAs that blocked agents. Some target venues for travel tasks had become unavailable. Time-sensitive tasks involving future dates had become invalid as those dates passed. Tasks with ambiguous instructions allowed multiple valid interpretations that the original evaluators marked as failures.

The team spent two months and ten people solving these issues. They made an interesting design choice in the process by primarily modifying the evaluators rather than the tasks. The reasoning was that changing tasks would break score continuity with previous evaluations, while fixing evaluators preserved the ability to compare scores across versions. For example, they added more sophisticated document and image comparisons and proxy support for sites with aggressive bot detection, or expanded the set of accepted answers rather than narrowing the instructions.

Their retrospective contains an underrated insight into benchmark design: “providing reliable rewards consumes more human resources than we imagined.” Even with 400+ person-hours of pre-release checking, 300 issues still surfaced. The fix wasn’t to check harder before release; it was to build infrastructure for ongoing maintenance.

HLE has its own maintenance story. After release, the team ran a community feedback bug bounty program through March 2025, which removed errors flagged by users. They also identified “potentially searchable” questions (questions a model with web search answered correctly, but the same model without search failed) and manually audited each one, removing those that could be found through web search. Despite this work, an audit of chemistry and biology samples later found that around 30% of them were potentially incorrect, suggesting that even active maintenance has limits when the questions require expertise that the maintainers don’t have.

OSWorld also made a structural argument that’s worth highlighting: decentralized evaluation doesn’t work for benchmarks like this. When everyone runs the benchmark on their own infrastructure, small differences in environment configuration accumulate, and people who find issues have no incentive to report them upstream. Worse, some teams modify tasks to fit their agents and then report scores as if they had run the standard benchmark. The OSWorld team’s response was to set up a centralized AWS-based evaluation platform where they run agents themselves and verify the results. This is more expensive than asking model developers to self-report, but it’s the only way to keep scores comparable over time.

IDE-Bench is too new to have a maintenance story, and its design reduces some of the surface area for the kinds of issues OSWorld faced. Their repositories are private and frozen, so they don’t change underneath the benchmark. The tasks are graded by test suites that the team controls, so there’s no anti-bot detection to create issues. However, the benchmark will face its own maintenance challenges over time. As models improve, the tasks that no current model can solve will get solved, and the team will need to extend the benchmark with harder tasks to preserve headroom. Whether they do that work is what will determine whether IDE-Bench remains useful in two years.

Conclusion

Capability benchmarks are how the field measures progress, and the design choices behind them shape what “progress” means. The five principles in this post appear across various benchmarks because they map onto problems every benchmark designer has to solve, even when the solutions look nothing alike.

Contamination resistance comes first because a benchmark that has leaked into training data measures memorization rather than capability. The choice between private codebases (IDE-Bench), adversarial filtering (HLE), and reliance on a volatile environment (OSWorld) constrains what the benchmark can cover for the rest of its life.

The grading mechanism determines the scope of the benchmark. Exact match scales but limits questions to those with single correct answers. Test suites work for code but require correctness to be expressible as tests. Custom evaluators are the most flexible but the most fragile. There is no neutral choice here; the grading mechanism decides what kinds of capabilities the benchmark can measure.

Difficulty is calibrated against frontier models, or the benchmark saturates and stops distinguishing between models that are obviously different. The three approaches (adversarial filtering, real-task selection with frontier-model checking, and inheritance of difficulty from a complex environment) each have their own failure modes, and the right choice depends on what the benchmark is for.

A single number per model is no longer enough. Calibration error, failure taxonomies, token efficiency, and per-category performance all reveal what the headline score hides. The field is moving in this direction, not because researchers prefer complexity but because single-number reporting creates optimization pressure that degrades capabilities that aren’t being measured.

Maintenance is the principle most invisible from the outside and most consequential over time. A benchmark released and forgotten loses meaning within a year. A benchmark with active maintenance, transparent issue tracking, and a centralized evaluation infrastructure can stay useful for much longer.

The practical implication is that the number is downstream of all of these choices. Two benchmarks claiming to measure the same capability can produce very different rankings depending on how they handle contamination, grading, difficulty, reporting, and maintenance. Reading a benchmark paper carefully, especially the methodology and limitations sections, is the only way to know what a score actually means. The single number is the easy part to read. The design choices are what make it meaningful.