AfterQuery

The Role of Human Data in Training AI Models

The role of data in AI training is changing. One might think that human data loses its prominence as synthetic data is replacing it, and training pipelines are getting automated.

This assumption misses something important. Human contribution to AI training is not disappearing, but evolving. The routine parts are being automated, while the remaining human input becomes more important for frontier capabilities. The parts that still require people are the parts that require genuine expertise and judgment, and those parts are becoming the bottleneck that determines how capable a model actually is.

To understand why, you need to know that training a modern large language model happens in stages, and each stage uses a different kind of data. Pre-training teaches models to imitate human-written texts by exposing them to trillions of words from the internet. Preference training, examples of which are RLHF and Constitutional AI, improves model performance based on which responses of this model people or an AI prefer. Supervised fine-tuning (SFT) shows the model examples of how a helpful assistant should behave. Reinforcement learning (RL) puts the model in environments where it can practice tasks and learning by doing. We wrote a detailed explanation of these stages in another post. Here we’ll focus on the human element at each stage and why it matters a lot.

Pre-training: curation is the invisible lever

The popular image of pre-training is the following: scrape the internet, and feed it to the model. In reality, the human judgment involved in curating pre-training data is a very important part of building an LLM.

Not all text on the internet is equally useful. A well-sourced scientific paper teaches a model different things than a Reddit thread full of low-effort arguments. A Wikipedia article with numerous references is more valuable than an LLM-written content farm page stuffed with keywords. The internet contains trillions of words, but the portion that is well-written, factually accurate, and diverse enough to train a capable model is a small fraction of it. Someone has to decide what gets in, what gets filtered out, and how to balance different types of content against each other. How much code versus how much natural language? How much scientific text versus how much casual conversation? These decisions determine the resulting model’s strengths and weaknesses, and they require human judgment that is difficult to fully automate.

The clearest demonstration of how much curation matters comes from Microsoft’s Phi series of models that are surprisingly capable for their size. The Phi team showed that a small model trained on carefully selected “textbook quality” data could outperform much larger models that were trained on bigger datasets with less careful curation. The Phi models had fewer parameters and saw less data, but the data they saw were chosen with far more care. This result made a simple point concrete: a human who is good at selecting training data is a multiplier on everything that happens to the model training downstream.

The economics of pre-training data are shifting in an interesting way. The technical challenge of assembling a pre-training dataset is becoming understood by frontier AI developers better and better. They know how to scrape and filter it. What is getting harder is the legal environment. AI developers have signed licensing deals worth hundreds of millions of dollars with organizations like Reddit and various news publishers. xAI went further and acquired X, the entire social network. The technical cost of curating pre-training data is going down, but the legal cost of the right to use high-quality text is going up. For pre-training, the emerging moat is legal access to premium content.

Supervised fine-tuning: where a small amount of the right data changes everything

After pre-training, a model has absorbed enormous amounts of knowledge, but it doesn’t know how to be a useful assistant.

Supervised fine-tuning (SFT) solves this by showing the model examples of the behavior you want: a user asks something, and a helpful assistant responds clearly. The model learns from these demonstrations in a similar way it learned during pre-training, but now it learns how to be an assistant, not just a human imitator.

What’s surprising is how little data this requires compared to pre-training. While pre-training consumes trillions of words, SFT might use just thousands of examples. But those examples need to be good. The quality of each individual demonstration matters far more than the total count, because each one is teaching the model what being helpful looks like.

The most vivid recent illustration of this comes from DeepSeek R1, the Chinese model that shook markets in early 2025. DeepSeek first trained a version called R1-Zero using pre-training and reinforcement learning. R1-Zero developed impressive reasoning capabilities, matching OpenAI’s o1 on some benchmarks. But it had a problem: its outputs were messy, it occasionally mixed languages unpredictably, and sometimes produced unreadable text. The model could reason but struggled to communicate.

To fix this, DeepSeek created R1 by fine-tuning their base model on just thousands of carefully selected examples. This cold-start data was small in volume but high in quality: the team collected readable reasoning traces, filtered out poorly formatted outputs, and refined results with human annotators. The effect was significant. R1 kept the reasoning power of R1-Zero while producing outputs that humans could actually follow and use. Thousands of well-chosen examples turned a brilliant but incoherent system into a product that competed with the best models in the world.

The lesson for understanding AI training is that supervised fine-tuning is a stage where human taste and judgment have disproportionate leverage compared to the amount of data involved. The people writing and curating these demonstrations are making choices about what good output looks like, and those choices affect every response the model produces afterward.

Reinforcement learning: models learn by doing, but someone has to build the world they practice in

The most recent development in how frontier models are trained is reinforcement learning with automated feedback. Instead of showing the model examples of good behavior, you put it in a situation where it can try things, observe results, and learn whether it succeeded or failed. If a model writes code, you can run the code and check if it passes tests. If it solves a math problem, you can verify the answer. The model gets a signal: that worked, or that didn’t. Over many iterations, it gets better.

This approach is producing rapid gains. METR, an AI safety organization, tracks how well models can accomplish software engineering tasks of increasing complexity (as measured by the amount of time it takes humans to complete them) and created a famous plot showing exponential progress of models: the length of tasks that models can handle has been growing exponentially, roughly doubling every few months. Frontier models could only handle several-minute-long tasks a couple of years ago, but current models can work on genuine engineering problems that take hours to complete, and the majority of this improvement comes from RL.

RL training depends on expert human data. It needs an environment: a simulated world with long and complex realistic tasks and reliable feedback signals. In software engineering, you need a real codebase on a computer and a test suite that reliably tells you whether the code actually works. And for RL training to be effective, you don’t need one such environment. You need thousands of them, each with different tasks and different levels of difficulty.

Building these environments is where human expertise becomes critical in a new way. The people designing RL environments for software engineering need to understand what real tasks look like, what makes a test suite reliable, and what constitutes a realistic set of tasks that teaches the model real software engineering skills. If your environments only contain simple bugs with obvious fixes, the model learns to solve simple problems. If they contain the kind of messy, ambiguous challenges that real software engineers face, the model develops skills that generalize. The quality of the environment directly determines the quality of the resulting model.

This is the part of the training pipeline where AfterQuery works. We build RL environments for software development, providing frontier AI developers with realistic, diverse, and well-validated training grounds that their models need to keep pushing the curve on the METR chart upward.

What about synthetic data?

Some people might say that synthetic data is making human input obsolete. AI models are increasingly generating training data for other AI models, and in some cases for themselves. Anthropic’s Constitutional AI, which is their version of preference training, is a good example: instead of hiring thousands of human raters to judge which model response they like more, Anthropic wrote a set of principles (the “constitution”) and let the AI rate its own outputs by testing them against the constitution. This dramatically reduced the need for human raters compared to other preference training approaches, like RLHF. But even though the process is automated, the human input didn’t disappear; it became more high-level. Instead of rating individual responses, as other companies do, humans designed the principles by which responses should be rated by AIs. The routine labor was automated, while humans still designed the whole process.

Synthetic data has a deeper limitation, though. Model-generated text is less diverse than human-generated text, and this compounds as models are trained on the outputs of models that are also trained this way. Researchers have studied what happens when you train a model on the outputs of a previous model, and then train the next model on that model’s outputs, repeating across multiple generations. Each generation loses more of the unusual and rare content that exists in human-written text. After several generations, the output converges toward a bland, repetitive mean, and performance degrades: a phenomenon researchers call model collapse. One study demonstrated this with image generation: models trained on AI-generated faces gradually lost the ability to produce faces of underrepresented ethnicities, converging on a narrow set of features from the majority of training examples. Synthetic data has its uses, but it cannot replace the diversity that human-generated data provides.

The value is shifting

The parts of AI training that are easiest to automate are getting automated. Automated reward signals can tell a model whether its code compiles without any human in the loop.

What remains is the work that requires judgment. Deciding which data belongs in a pre-training corpus. Writing the demonstrations that teach a model what a good assistant would respond. Designing RL environments that reflect the complexity of real-world tasks rather than presenting a model with toy problems.

This has a concrete implication for anyone evaluating companies that provide training data to AI developers. “Training data” is not one market. It is several markets with fundamentally different dynamics, and the direction each market is heading matters more than its current size.

Pre-training data is becoming well-understood technically. The hard problems are legal, not technical: obtaining the right to use high-quality text as copyright litigation intensifies as licensing deals get more expensive. Companies in this space are building moats out of content access and legal agreements, not out of technical expertise that competitors can’t replicate.

Other stages of training are moving in the opposite direction. The technical challenges are growing. Building thousands of high-quality RL environments for software engineering is a genuinely hard problem that requires deep domain knowledge, and the AI developers’ appetite for this kind of data is increasing as they push harder on RL training. The moat here is expertise: understanding what makes a training environment realistic, diverse, and reliable enough to produce models that perform well on real tasks.

The broad trend is that as we need to train models to become more and more capable, human contribution to AI training is concentrating into higher-skill, higher-leverage roles. The value created by those who remain is increasing, because they are doing the work that the models themselves cannot yet do.

Conclusion

Training data for AI is not becoming less human. It is becoming human in a different way. The volume of human labor in the pipeline may decrease as synthetic data and automated feedback take over routine tasks. But the value of the human work that remains is growing, because it sits at the points in the pipeline where judgment, expertise, and taste determine whether a model is merely functional or genuinely capable. For anyone trying to understand where the AI industry is heading, this shift matters: the companies that will be hardest to replicate are not the ones sitting on the largest piles of data, but the ones whose people know how to create the data that models cannot generate for themselves.

Back to all posts