AfterQuery

What Is Training Data for Machine Learning

How models are trained

Traditional software is based on rules. For example, a programmer might write a rule for banking software: “if the transaction exceeds $10,000, flag it for review.” Machine learning is different: instead of writing rules, you give a system a set of examples and the correct outputs it should produce, and it figures out the rules to achieve those outputs from data on its own. This set of examples and outputs is called the training data of a machine learning system.

Show an image classifier ten thousand photos labeled “cat” and ten thousand labeled “dog,” and it will learn to tell them apart without anyone specifying what a cat looks like. Show a fraud detection model millions of financial transactions labeled “legitimate” and “fraudulent,” and it learns patterns for identifying fraudulent transactions that a human analyst might never think to codify. In these examples, pictures of cats and dogs, as well as a dataset of transactions, are examples of training data for machine learning algorithms.

The quality and quantity of training data determine how well the system works. More samples of normal and fraudulent transactions in a training set for a fraud detection algorithm lead to better classification of future transactions, but if a lot of training samples are mislabeled or contain errors, then the system will perform worse.

This post focuses on training of LLMs, which is where the most interesting dynamics around training data are playing out right now. LLMs require vastly more training data than other types of ML systems; the economics around acquiring this data are changing fast, and the way AI developers use training data has evolved into a complex multi-stage process.

Pre-training: swallowing the internet

When an AI developer like OpenAI or Google trains a new LLM, the first stage is pre-training. This is where the model processes an enormous amount of text and learns to imitate this human-written data.

The training data for pre-training can come from any part of the Internet: scientific papers, GitHub, Wikipedia, and news articles. Modern pre-training datasets contain trillions of words. For reference, the entirety of the English Wikipedia is roughly 5 billion words, so a large pre-training dataset might be equivalent to a thousand Wikipedias.

At this stage, a trained model learns to predict the next word in a text. For example, a training set might contain the phrase “The capital of the UK is London.” The LLM is shown its first part: “The capital of the UK is” and is asked to add a word to the end. If it’s early in the training process, it might make a mistake and write “The capital of the UK is Paris”. In this case, the learning algorithm changes the internals of the LLM, so next time it would answer “London” instead. This process sounds trivial, and in the early days of LLMs, some people called them “glorified autocomplete” because this is what they do: they predict the next word in a text. But this training turned out to be extremely powerful because to predict the next word well, the model has to learn grammar, facts about the world, logical reasoning, human behavior patterns, and a lot more.

Not all text on the internet is equally useful. A well-written scientific paper teaches the model different things than a Reddit thread filled with low-effort arguments and tribalism. Labs invest significant effort in filtering and curating their pre-training data. This includes removing duplicate content, filtering out low-quality pages, balancing the mix between different domains (how much code vs. how much prose vs. how much scientific text), and handling sensitive or toxic content. It’s important to balance different types of sources: a model trained on too much code and not enough natural language will be great at programming and awkward at conversations.

The main way of making LLMs more capable is to scale their size and the amount of training data, but there is a practical problem here: we are running low on new high-quality text data. The internet is large, but the portion of it that’s well-written, factually accurate, and diverse enough to be useful for training is finite, and labs have already used most of it. Epoch AI predicts that with current trends, models will end up using all the high-quality human-generated texts within the next several years. One possible solution is synthetic data: using existing AI models to generate new training text. Synthetic data is promising, but it carries risks. Model-generated data is much less diverse than human-generated data. If you ask ChatGPT to tell you a joke many times, it will repeat the same several jokes over and over, and training on this data leads to even less diverse outputs of AI. Also, if a model is trained too heavily on outputs from other AI models, it can inherit the biases and errors of those models, a failure mode researchers call model collapse. The question of where the next generation of high-quality pre-training data will come from is one of the most important open problems in the field.

Post-training: from raw intelligence to a useful assistant

A pre-trained model can complete any text, and its completions are often coherent and knowledgeable. But this capability by itself is not particularly useful. Ask it a question, and instead of answering, it might generate five more questions, because that’s what the next words in a text that contains a question often look like. It has absorbed a lot of knowledge about the world, but hasn’t learned to be an assistant or an agent.

Post-training is the process that turns this raw text predictor into something you can actually have a conversation with. Over the past several years, this stage has grown from a minor addition into a complex multi-step pipeline that is now usually bigger than pre-training.

Supervised fine-tuning

The simplest form of post-training is supervised fine-tuning. You show the model examples of the behavior you want: a question from a user, and a response of a helpful assistant. A user gives an ambiguous instruction, and the assistant asks for clarification instead of guessing. The model learns from these demonstrations the same way it learned during pre-training: by predicting the next word.

The training data for supervised fine-tuning is much smaller than for pre-training, and conversations for it are typically created by human contractors who are given detailed guidelines on how the ideal assistant should behave: be helpful, be honest, don’t make things up.

RLHF: learning from human preferences

Supervised fine-tuning has a limitation. Writing perfect responses by hand is slow and expensive, and for many questions, there is no single perfect response. A different approach is to let the model generate several responses and then have humans choose the best one.

This is the idea behind reinforcement learning from human feedback, or RLHF. The process works roughly like this: the model generates two or more responses to the same prompt, a human rater reads them and picks the better one. Then the training algorithm modifies the model, so the next time it is more likely to output this best answer, so over time the model learns to produce responses that humans prefer. Models are infamously sycophantic, and this sycophancy mostly stems from RLHF, because people prefer when models agree with them.

RLHF is widely considered to be the key ingredient that made ChatGPT, which was released in November 2022, feel much smarter and more useful compared to earlier language models. Models before it were somewhat intelligent but erratic. They would sometimes produce thoughtful responses and sometimes produce nonsense or harmful responses.

RL training: learning from automated feedback

The most recent development in post-training is reinforcement learning with automated feedback signals instead of human raters. The idea is to put the model in a situation where it can try things, observe the results, and learn whether it succeeded or failed. If it succeeds, the behavior that has led to the success gets reinforced, and the model behaves similarly in the future.

This approach is gaining traction because human feedback is expensive and slow to collect, and because for some tasks, there are objective measures of success that don’t require human judgment. For example, if a model writes code, you can run the code and check if it passes the automated tests. If a model solves a math problem, you can verify whether the answer is correct. At the same time, this stage is less suitable for training of more subjective capabilities, like creative writing.

RL environments: where models learn by doing

For RL training to work, the model needs something to practice on. In the math case, this is straightforward: you need a large collection of math problems with verifiable answers. But for more complex tasks like software engineering, you need something more involved. You need an RL environment.

An RL environment is a simulated world where the model can take actions and receive feedback. For example, to train a model to fix bugs in real software, you need a codebase with a known bug. You need to run the model and the code on a computer, so the model can read the code, make changes, and run commands, the same way a human developer would. You need a test suite that can tell you whether the bug was actually fixed. And you need all of this running inside a sandbox, so the model can’t accidentally break anything outside the environment. For RL training to be effective, you need thousands of RL environments: thousands of realistic codebases, each with realistic bugs and reliable test suites.

The quality of the RL environment directly affects the quality of the resulting model. If your environment only contains toy problems with simple bugs and trivial tests, the model learns to solve toy problems. If your environment contains realistic, diverse software engineering challenges, the model learns skills that generalize. This is why building RL environments is becoming an important part of the AI training pipeline, and it’s what AfterQuery does: we build RL environments for software development.

Why training data is a strategic question

Training data has become one of the central competitive dynamics in AI, and understanding it matters for anyone evaluating AI companies or investing in the space.

The most obvious dimension is data as a moat. For pre-training, the major labs have signed licensing deals worth hundreds of millions of dollars with organizations like Reddit and various news agencies. These deals exist because of the legal pressure around training on copyrighted content. The New York Times is suing OpenAI for using its articles for training, and the outcome of that lawsuit and similar ones could significantly affect what data is available for training.

But the more interesting shift is about what kind of data matters. In general, more pre-training data and bigger LLMs produce better results, but current frontier models are huge, and scaling pre-training datasets and model size becomes more expensive than scaling post-training, so the frontier labs are increasingly competing on it: who has the best fine-tuning data, the best human feedback pipelines, the best RL environments.

For investors who want to invest in startups providing training data, it’s important to understand that “training data” is not one thing to evaluate. The question is which stage of training the data is for, how defensible it is, and whether it’s the kind of data that is becoming more important or less important as the field evolves.

Conclusion

Training data for machine learning is the set of examples that a system learns from. For LLMs, this data comes in stages, and each stage serves a different purpose.

Pre-training gives the model broad knowledge by exposing it to trillions of words from the internet. Supervised fine-tuning teaches it to behave like an assistant. RLHF aligns its outputs with human preferences. RL training in specialized environments teaches it to solve complex tasks through trial and error.

The field is shifting. Pre-training drove the first wave of LLM progress, but labs are now getting increasing returns from post-training, particularly from RL environments that let models learn by doing. The bottleneck is moving from “do you have enough text from the internet” to “do you have high-quality environments and feedback signals for the skills you want the model to learn.”

For anyone trying to understand the AI landscape, the single most useful thing to take away is that training data is not a monolithic concept. The data that makes a model knowledgeable is different from the data that makes it helpful, which is different from the data that makes it good at writing code or solving math. Each type has its own economics, its own competitive dynamics, and its own trajectory.

Back to all posts