
In November 2024, METR ran an experiment. They assigned identical sets of AI R&D problems to 61 machine learning engineers and a set of frontier AI agents. Examples of problems included writing a kernel or fine-tuning a model. The tasks were derived from real research work, with the humans and agents receiving time budgets of 2, 8, or 32 hours to work.
. At 2 hours, the best AI agent scored four times the average human engineer. The agents were faster, tried more approaches, and wrote code more quickly than the engineers. But as the time budget grew, humans caught up. The results were split cleanly at the 8-hour mark. At 32 hours, humans scored roughly twice as high as agents. This was because the agents ran out of ideas, looped back on broken approaches, and couldn't recover from wrong turns. The humans kept going.
This is the shape of the problem this post focuses on. Frontier AI models exceed humans on short-horizon tasks and are now moving to longer ones. METR tracks this using a chart that has become the reference point for AI progress: the length of tasks models can complete at a 50% success rate has roughly doubled every 6-7 months over the last 7 years.
The question is where this progress originates. The answer, increasingly, is that it comes from training data that only domain experts can produce. This data rarely appears online and takes a human 8 hours to generate. It requires experts with deep experience in this field to build, as the bottleneck for frontier capability is no longer compute or pre-training text. Rather, it is the supply of experts who can produce this data.
What the web actually contains
To see why expert data is the bottleneck, you must think about what pre-training data actually looks like. The web is large but shallow. It contains finished outputs, including code repositories, published papers, and Stack Overflow answers. But it does not contain the work that produced them.
Take a typical GitHub repository. The commit history shows you what code appears in the main branch. It does not show you the three hours the engineer spent monitoring the stack trace, the two hypotheses they ruled out before finding the real cause, or the Slack thread where a colleague pointed out a subtle bug. The web preserves the final product, but discards the lengthy process.
The same pattern holds across domains. Published papers report results, not the failed experiments that preceded them. Medical charts record the diagnosis, not the differentials the clinician considered and rejected. In every case, what sits on the web is the compressed output, and what's missing is the trajectory that led to it.
This is important since now AI agents are being asked to do the work. To fix a bug in a production codebase, a model needs to debug like an engineer, not write a commit message like an engineer. To handle a novel medical case, a model needs to reason through differentials rather than reciting the final diagnosis. The training signal for these capabilities has to come from the process, which the web does not have.
Ilya Sutskever made the broader version of this point in 2024. "Pre-training as we know it will unquestionably end," he said, "because we have but one internet."
The data wall and what comes after
Sutskever's point lands differently depending on how you read it. The weaker version most people hear is that we are running out of text. The crawlable web is finite, and the large AI developers have already ingested most of it. Epoch AI's projections estimate exhaustion of high-quality public text around 2028.
The stronger version is about the type of data that matters. Alexandr Wang, before Meta acquired Scale and moved him into Meta's superintelligence lab, framed it this way: "One of the bitter lessons of AI is that you are fundamentally bottlenecked by the data that you have to train these models on. We are fundamentally bottlenecked by the production of frontier data, as well as the production of these self-play environments for reinforcement-based learning."
Frontier data is the critical phrase here. He argues that frontier data sits at the edge of human ability and differs from the average data found on the internet. A million casual code samples from GitHub don't teach a model to debug a subtle AI training issue. A million Reddit threads don't teach a model to handle an ambiguous clinical case. For each capability a model requires, a type of data can teach it, and for the capabilities that matter at the frontier, that data is rarely found online.
This changes how training works. Pre-training still matters, but the final model is increasingly determined by what happens after pre-training. This includes supervised fine-tuning demonstrations, preference data, and especially the RL environments in which models learn. Each of these stages requires data that was produced deliberately by someone. Further, as the capabilities trained move closer to the frontier of human performance, the people producing the data must keep up. Crowdworkers who cost $10 an hour cannot produce the trajectories that teach a model to behave as a senior engineer can.
DeepSeek-R1's accidental confession
One of the cleanest pieces of evidence that expert data is the bottleneck comes from DeepSeek. When they released the R1 model in early 2025, the headline revealed that reasoning capabilities could emerge from pure reinforcement learning on math problems, without the supervised demonstrations of reasoning. This was a genuine breakthrough as a model taught itself to think step-by-step by being rewarded for correct answers on math problems with verifiable solutions. The approach worked because math has a cheap verifier: you checked the final answer against the ground truth.
Yet buried in the R1 technical report and later in the paper DeepSeek published in Nature, there is an admission that indicates a lot about the current state of RL. The authors write that R1 has not demonstrated a substantial improvement over DeepSeek-V3 on software engineering benchmarks. This is because large-scale RL has not been applied extensively in software engineering tasks.
The pattern is worth spelling out. RL on math works because a math problem is a short episode with a cheap verifier. You can automatically generate millions of math tasks. A model tries a solution, a script checks the answer, and the signal returns in minutes. RL on software engineering is different. A real engineering task takes several minutes or hours to attempt. The verifier is a test suite that has to be constructed by someone who knows its correct behavior. If the tests miss the intended behavior, the signal is incorrect.
So, the same technique that produced superhuman math reasoning in a few months of training didn't move the needle on software engineering, all because software engineering doesn't have the cheap, verifiable tasks that math has. The only way to close the gap is to build environments in which the tasks, tests, and grading infrastructure are designed by people knowledgeable in the domain.
Why experts, specifically
You can generate millions of candidate SWE tasks synthetically, but someone must always review them. They have to check whether the tasks are sufficient, if they actually measure the intended behavior, and whether the graded solutions are correct. Given that models are quickly improving, the pool of people who can do this reviewing shrinks.
OpenAI faced this issue when developing CriticGPT, a model trained to catch bugs in code written by ChatGPT. Their original plan was to have crowdworkers flag bugs in model-generated code and use those flags as training data. The problem was that ChatGPT's code is usually correct, and when it's wrong, the bugs are often subtle. The crowdworkers weren't catching enough bugs to produce a useful training signal. OpenAI resorted to manually inserting bugs into the code and having the crowdworkers write critiques as if they had found them. Their paper admits the limitation directly: "the distribution of inserted bugs is quite different from the distribution of natural LLM errors." They also frame the broader problem: "as we advance in reasoning and model behavior, ChatGPT becomes more accurate, and its mistakes become more subtle. This can make it hard for humans to spot inaccuracies when they do occur, making the comparison task using RLHF much harder."
This is the verification ceiling. Grading a model's output requires individuals who can distinguish between strong and weak outputs. For the majority of machine learning's history, this was a solved problem: image-labels, text quality ratings, and preference comparisons between short responses. The grader just had to be literate and attentive. Now, models routinely produce outputs appearing plausible even when they're wrong. Thus, the grader must be able to distinguish, and for frontier tasks, be able to do the task themselves.
What this looks like in biosecurity
The verification ceiling is sharpest in physical work domains, not just intellectual. Biosecurity is the clearest example.
In 2025, Frontier Model Forum funded a study by the non-profit Active Site in partnership with Sentinel Bio. They recruited 153 individuals with limited biological lab experience, asking them to complete a set of core biology tasks. Half of them had access to frontier AI models. The other half could only use the internet. Before the study, expert forecasters predicted that roughly 27% of the AI-assisted group would succeed, while the internet-only group would only have a 12% success rate.
The results were much lower than what was expected. About 6.6% of the internet-only group completed all three core tasks. The AI-assisted group did worse, scoring 5.2%. Even with the best models available, most participants failed, and access to a frontier model did not make a measurable difference. The bottleneck wasn't information. The participants had access to protocols, tutorials, and AI models that could explain each step. What they did not have was the tacit skill you get from actual lab experience. This includes understanding healthy cell culture, spotting early signs of contamination, and handling a pipette without drift.
Kevin Esvelt, a biologist at MIT who studies biosecurity risks, put it plainly: "You need to be able to culture mammalian cells. And that is a form of tacit knowledge barrier, because until you've been trained in it, it's just really hard to pick it up yourself without contaminating everything."
The same gap appears in a benchmark called the Virology Capabilities Test, built by the group SecureBio. They asked PhD virologists to answer questions about their own sub-specialties. The virologists scored an average of about 22%. On the contrary, OpenAI's o3 model scored 44%, far better on paper than the experts. It should be noted that the test was specifically composed on knowledge acquired in lab meetings and bench work, not textbook information.
Reading the two results together
The two findings contradict each other. The wet-lab study claims models barely help novices complete real tasks. The virology test results show that models outscore PhD virologists on questions regarding their own sub-specialties. Which one is right?
Both are, and the gap between them is the whole point. The virology test measures what a model knows. The wet-lab study measures what a person can do with that knowledge. A model that has read every virology paper ever published can answer a multiple-choice question about cell culture contamination. It cannot tell a novice in real time that the slight cloudiness in their flask is bacterial growth and not normal cell debris. This is because it cannot see the flask, and because the relevant skill is visual pattern recognition built from hundreds of hours of lab experience.
This is the same gap appearing in software engineering. A model that has read every line of code on GitHub can produce a plausible patch for a bug. Whether that patch is actually correct, handles the edge cases the maintainer cares about, or fits the architecture of the codebase is a different question. The answer depends on the kind of judgment that senior engineers develop over the years.
METR published a result in March 2026 quantifying this for software. They reviewed pull requests AI agents made for the SWE-bench Verified benchmark, which utilizes automated tests to grade agent solutions for bugs in open-source repositories. The agents' patches passed the tests. When METR showed those same patches to the actual maintainers of the repositories, roughly half of the patches that passed the tests would not have been merged. The maintainers rejected them for reasons the tests couldn't see, including incorrect architectural choices, bad style, and fragile assumptions.
This follows the same pattern as the virology test. The model can produce outputs that look right to an automated grader. Whether the outputs are right to someone who actually does the work is a separate question.
What this means for training
The pattern from these studies has a direct consequence for how frontier models get trained. If the goal is to build a model capable of expert work, the training data must come from people capable of doing expert work. Not because experts are the only ones who can produce correct outputs, but because they are the only ones who can tell correct outputs from plausible-looking wrong ones.
What the market already shows
If expert data is the bottleneck, you would expect the market to reflect it, and it does.
The clearest signal came in June 2025, when Meta paid $14.3 billion for a 49% stake in Scale AI, the largest data-labeling company in the industry. Alexandr Wang, Scale's founder, moved to Meta to lead their superintelligence effort. It valued Scale at roughly $29 billion. A company whose main product is human-produced training data became one of the most valuable AI companies in the world, and the acquirer was a frontier lab that has no trouble scraping the web on its own.
Surge AI, a competitor that focuses on higher-end expert labeling, was reportedly in talks at a $25 billion valuation in mid-2025, on roughly $1.2 billion in revenue. Surge is bootstrapped and has around 120 employees. Mercor, a startup that hires domain experts on demand for AI labs, received a $10 billion valuation in October 2025 while having $500 million in annual revenue. Mercor reports paying its contractors an hourly average of $85, with physicians earning around $200.
Another company, Mechanize, which builds RL environments for agentic tasks, has been reported to offer $500,000 salaries to engineers who can design those environments well. This is not the price of data labeling. This is the price of the judgment required to decide what a good environment looks like.
A TechCrunch report from September 2025 cited discussions at Anthropic about spending more than $1 billion on RL environments in the following year. OpenAI's data budget is projected to grow from around $1 billion in 2025 to roughly $8 billion by 2030. Jennifer Li, a partner at Andreessen Horowitz, put it clearly: "All the big AI labs are building RL environments in-house. But creating these datasets is very complex, so AI labs are also looking at third-party vendors. Everyone is looking at this space."
Conclusion
The story of training data over the past several years has been one of moving up the stack. In 2020, the question was whether you could scrape enough text from the web. In 2023, the question was whether you could filter and curate text well enough to train a useful base model. Both of those problems are now largely solved. The frontier progressed to what follows pre-training, and that depends on data deliberately produced by humans.
The METR chart at the start of this post shows exponential progress on long-horizon tasks. That progress isn't coming from better scraping. It is coming from RL environments built by people who know what good engineering work looks like, from rubrics written by physicians who know what good medical reasoning looks like, and from verifiers designed by people who can tell a plausible-looking wrong answer from a correct one. The curve continues because labs find new ways to produce this data at a greater scale. It will only flatten when they run out of ways.
For anyone trying to understand the AI landscape, this reframes the usual questions. The question is not how much compute an AI developer has, because the leading labs all have enough. It is not how much pre-training text they have access to, because the legal and technical problems around that are well understood. The question is whether the lab can produce or purchase the kind of expert data that moves the frontier for the capabilities they care about.
The market has already priced this. The open question is who can actually produce the data. Scraping was a scale problem. Curation was a filtering problem. Expert data is a people problem, and knowledgeable people are hard to scale.