Dataset Curation for Machine Learning

Every Machine Learning (ML) system learns from data. The extent to which an ML system can excel depends on the quality of the data it utilizes. Engineers must decide what to include or filter out, how to label the data, and balance different types of data against each other simultaneously. This discipline is known as dataset curation, and it's one of the most important and least visible parts of building ML systems.
The common assumption in ML is that a higher quantity of data produces better models. For example, the famous Large Language Model (LLM) scaling laws suggest that model capabilities improve with larger models, more data, and longer training, providing evidence for this claim. This is generally true, but there is considerable complexity behind this claim. In scenarios, a curated small dataset can outperform a large loose one. The decisions that go into data curation often matter more than the model's architecture or the raw volume of available data. Such decisions are where companies that build ML systems differentiate, exemplifying why they are hard to evaluate from the outside.
This post explains how dataset curation works across different domains of machine learning. We'll begin with computer vision, then move deeper into LLM pre-training data and Reinforcement Learning (RL) environments, where the curation challenges are more complex and more relevant today.
Computer vision: where the field learned that curation matters
Building a computer vision dataset may seem straightforward: you collect images, assign labels that describe the images, and train a model to accurately label them. In practice, each step hides decisions that shape the resulting system in ways that aren't evident until something goes wrong.
Start with the collection. Images can come from internet scraping, giving you scale but limited control. Search engines reflect the biases of who uploads photos and what gets surfaced. A face dataset built this way overrepresents certain demographics that are more active online and underrepresents others, which directly affects the model's performance on underrepresented groups. Though images can also be collected in a more controlled manner: a self-driving car company might mount cameras on vehicles and drive long distances. This produces data that closely matches deployment conditions, but it's expensive and geographically constrained. A fleet operating in a warmer climate, such as California, will produce a model that struggles with snow. A third option is to pull from existing curated databases or licensed photo collections, giving you quality but less diversity than what the open internet provides.
Next comes labeling. Annotators have to look at each image and assign it a category. For millions of images across a high volume of categories, this requires massive human effort. The standard approach has been crowdsourcing using platforms like Amazon Mechanical Turk, where workers classify images for a small payment per task. While it is cheap and fast, workers are not domain experts. One can tell a cat from a dog, but distinguishing between two similar bird species or identifying a rare skin condition in a medical image requires expertise.
ImageNet is the dataset that best illustrates both the power and the pitfalls of this process. Built at Princeton University in the late 2000s, ImageNet organized millions of images into thousands of categories through a massive crowdsourcing effort. The process involved 49,000 Mechanical Turk workers from 167 countries. The workers classified over 160 million candidate images. Each image was labeled multiple times, and the team built quality control systems to determine the number of raters needed to agree before a label was accepted. This resulted in ImageNet becoming the most important dataset in the history of deep learning. In 2012, during the ImageNet Large Scale Visual Recognition Challenge, which was a trial for evaluating the capabilities of computer vision models, something remarkable happened. AlexNet demonstrated that deep neural networks could dramatically outperform other, traditional computer vision methods, and that result kicked off the deep learning revolution.
But ImageNet also became the best case study for how curation problems persist. In 2021, Northcutt et al. published a study examining label errors across several commonly used ML benchmarking datasets, including ImageNet. They found that roughly 6% of the ImageNet validation set contained wrong labels. These were systematic mistakes. This includes ambiguous categories that crowd workers disagreed on, fine-grained distinctions such as in similar dog breeds that non-experts couldn't reliably make, or images that could belong to multiple categories but were assigned only one.
When ImageNet corrected these labels, the model rankings changed. Researchers had been optimizing their models to fit noise in the labels, not just the actual visual patterns. For years, some of the apparent progress in image recognition was due to labeling errors on low-quality data that went unnoticed.
Ultimately, data quality sets an invisible ceiling on model performance. The field spent years building increasingly sophisticated architectures, while the dataset they were trained on contained errors that affected the model's quality. Now, this notion applies with even more force to LLMs, where the data is orders of magnitude larger, and curation decisions are harder to audit.
Pre-training data curation for LLMs
LLM curation faces the same fundamental problem as vision curation: quality matters more than volume. Though the mechanics are different. Instead of labeling images, you're filtering and balancing massive text corpora.
The raw material used for many pre-training datasets is Common Crawl, a publicly available archive of web pages containing hundreds of billions of pages. Most of this content is useless for training. Turning Common Crawl into a usable training dataset requires multiple stages of filtering. The first stage is heuristic filtering. These are simple rules that remove obviously low-quality documents. Any pages that are too short, pages with abnormal word distributions (suggesting they're AI-generated), pages with excessive repetition, or pages in languages you don't want are removed. These filters are crude, but they quickly remove a large fraction of unusable data.
The second stage is classifier-based filtering, and this is where most of the curation leverage lives. The approach involves training a classifier to distinguish between high-quality text and low-quality text. You take a set of documents you consider high quality, like Wikipedia articles, published books, and scientific papers, and a set of low-quality web pages, train a binary classifier on this data, and then score every document in your corpus. Documents above a determined threshold are included. Documents below get discarded.
This is where human expertise enters the pipeline. Someone has to decide what counts as "high quality." There is an implicit definition of quality when you construct the reference set that you use to train the classifier that shapes the resulting model. If your reference set is mostly in formal English prose, the model will be weaker at understanding casual conversation. If it's heavy on scientific text, the model might be great at reasoning, but its writing style will be harder for people to understand.
After filtering, you then face the balancing problem: how much of each type of content should the model see? How much code versus natural language? What about scientific text versus casual conversation? Or English versus other languages? These ratios directly affect the model's strengths.
A strong example of the importance of pre-training curations is seen in Microsoft's Phi series of models. Microsoft's team published a paper titled "Textbooks Are All You Need", where they trained Phi-1, a model with only 1.3 billion parameters, on a carefully selected set of "textbook quality" data: 6 billion tokens filtered from the web for educational value, plus 1 billion tokens of synthetically generated textbooks and exercises. Despite being trained on far less data than competing models, Phi-1 achieved over 50% accuracy on the coding benchmark HumanEval, outperforming models many times its size. They expanded this approach with subsequent models, reflecting that high-quality data can significantly improve capabilities not just in coding. This only improved. A 2.7 billion-parameter Phi-2 trained on carefully selected data outperformed models up to 25 times larger on reasoning benchmarks.
For a long time, the conventional wisdom in LLM development was that scale was the primary driver of capability, and there is a lot of truth in that. Microsoft’s Phi results showed that a team good at selecting and filtering training data can extract more capability from less compute and data. Two developers with the same compute budget and different data curation pipelines will produce models of very different quality.
Between pre-training and RL, there are other training stages with their own curation challenges. Supervised fine-tuning requires carefully crafted demonstrations of how an assistant needs to behave. Preference training (like RLHF) requires pairs of model responses ranked by quality. These stages matter, and will be covered in future posts. But now we must move on to RL environments, where curation is most complex, least understood, and increasingly important for frontier model capabilities.
RL environment curation
In pre-training, curation means filtering text content. In RL, curation means building model environments from scratch that practice tasks and give feedback on the model's success. The quality of these environments directly determines the quality of the resulting model, but the curation challenges are fundamentally different. We describe what an RL environment is in another post [LINK TO “The Role of Human Data in Training AI Models” post]. Briefly, it's a coding setup including a virtual machine, a codebase, a task, tests for grading the model's solution, and other aspects. The model reads code, makes changes, runs commands, and receives a signal to determine whether the solution worked or not. For RL training to be effective, you need thousands of these environments, each with different tasks at different levels of difficulty. Building them creates two difficult problems: where do you get realistic tasks, and how do you calibrate difficulty so that models actually learn?
Task sourcing
The most well-known approach to sourcing RL tasks for software engineering is SWE-bench, which mines GitHub repositories for real issues from real codebases. The method involves finding an issue and a solution to fix it, and using the repository's existing test suite to verify that the model completed the task. This gives you real tasks from real software, which is a notable advantage over hand-crafted problems.
This approach has constraints. It only works for repositories that have comprehensive tests, because without reliable tests, there is no way to verify whether the model's fix actually works. The tasks you extract are biased toward certain kinds of work, primarily bug fixes in well-maintained open source projects. A significant amount of software engineering falls outside that scope: for example, building new features, refactoring the codebase, or debugging performance issues.
DeepSeek's V3.2 model, released in late 2025, offers a detailed look at a more sophisticated approach. Their technical report describes several separate task synthesis pipelines, each for a different type of agent capability. For coding agents, they mined millions of pairs of issues and their solutions from GitHub across multiple programming languages, building tens of thousands of executable environments for issue resolution.
Three things about this pipeline are worth noting.
1) It requires substantial effort to build. This isn't scraping data from the internet. It's constructing entire simulated worlds with tasks, tools, and verification systems.
2) The pipeline is domain-specific. Different types of tasks require a different approach to sourcing and validation. There is no generic method that works for all of them.
3) The scale is large: there are over 1,800 distinct environments and even more prompts for tasks, giving the RL training process enough diversity to produce skills that generalize to any coding task.
Difficulty calibration
Creating realistic tasks isn't enough. You also need the right distribution of difficulty. If all your tasks are easy, the model solves them quickly and stops learning. If they're too hard, the model never solves any tasks and also won’t learn. Useful training happens in the zone where the model sometimes succeeds and sometimes fails, just as humans learn.
DeepSeek's V3.2 pipeline tackles this through a design principle. They describe it as "hard to solve, easy to verify." The verification of model solutions needs to be cheap and reliable, because RL training requires running the model against tasks thousands of times. Though the tasks themselves must be genuinely challenging. Their filtering approach: run the model against candidate environments, compute how many times the model succeeds in 100 attempts, and keep only environments in which the model outputs a correct solution more than 0 times.
The result is a curated set of environments that sit in the productive difficulty range. Their results suggest the calibration worked: even GPT-5, a frontier model, achieved only 62% accuracy on the dataset.
Conclusion
Dataset curation is the discipline of turning raw data into something a model can learn from. The details vary across domains, but the core principle is the same: the decisions about what data to include, how to filter it, and how to structure it determine the ceiling on what the resulting model can do.
In computer vision, this meant that errors in the ImageNet dataset silently limited progress for years. In LLM pre-training, it means that filtering and balancing decisions across trillions of tokens shape every capability of the resulting model. In RL environments, it shows that task sourcing and difficulty calibration determine whether a model learns generalizable skills or overfits to narrow patterns.
For anyone trying to understand which AI companies will produce the best models, data curation is one of the crucial elements of the training pipeline, which is trivial to evaluate from the outside.