What Happens During Model Pre-Training

Data Graph
No headings found. Add headings to your CMS content to populate the table of contents.

Pre-training is the first step in training an LLM, turning a model that outputs random words into one that learns to imitate human texts. Frontier models require huge pre-training runs: for example, pre-training of DeepSeek R1 consumed approximately 11 trillion words and required a cluster of 2.000 GPUs working for 2 months.

The way this works is conceptually simple. The model sees a sequence of words and tries to predict what comes next. Take the sentence "The Eiffel Tower is located in ". The model assigns a probability to every possible word in its vocabulary. Early in training, it might decide that the best next word is "banana", but the correct answer is "Paris," so the training algorithm adjusts the model's internal parameters to make "Paris" more likely next time it sees a similar context. Multiply this by trillions of examples, and the model gradually gets better at predicting what word comes next in any text. The power of pre-training is that models learn not just that "Paris" is an appropriate word in this exact sentence, but it eventually learns that the Eiffel Tower is actually in a city named Paris. During this process, models also learn more abstract things like mathematical reasoning. For example, if a model is presented with a flawed theorem proof, it can only correctly end the phrase "The proof fails because the " if it understands why the proof is flawed. The same thing with human psychology: Only a model that understands humans might finish the sentence "She always cares about him because she is a ". The training objective is simple, but the capability it demands is not.

Most AI developers, even open source ones, don't share much about their pre-training process. They release models and benchmark scores, but do not disclose a lot about the training process. The Olmo 3 model is an exception. Its developer released everything: training code, data, mid-training model snapshots, and the evaluation suite they built to monitor progress during training. This post uses Olmo 3, along with some details from the DeepSeek V3 model technical report, to walk through the mechanics of pre-training: how the training objective works, how it's evolving, and how teams figure out whether a training run is actually producing the capabilities they want.

Tokens, not words

I described pre-training as predicting the next word, but that's a simplification. Models don't operate on words. They operate on tokens, and they predict the next token, not the next word. To say simply, a token is a word or a chunk of a word, and different models slice words into chunks differently. Common short words, like "the" or "will," are usually a single token, while longer and rarer words might be split into several tokens. For example, Claude Sonnet 4.6 splits the word "simplicity" into tokens "simpli" and "city", while GPT-5.4 treats it as one token "simplicity". A total token vocabulary is typically 50,000 to 150,000 different tokens.

Vocabulary size is a tradeoff. If tokens are short: mostly 3-5 symbols in length, then there are fewer of them in a vocabulary, and since each token needs corresponding parameters in a model, longer tokens would consume too many parameters in small models, which would make small models disproportionately expensive to train and run. If tokens are long, then there are more of them in a vocabulary, as with longer tokens, you need more of them to cover all possible texts. Intuitively, you might think of this as follows: there are 17 thousand possible 3-letter combinations with the English alphabet, but 12 million possible 5-letter combinations.

At the same time, a larger vocabulary means that you can encode the same sentence with fewer tokens, and since all LLM computations are fixed per token, both training and inference become faster and cheaper.

This matters for the understanding of pre-training numbers. When DeepSeek says they trained on 14.8 trillion tokens, that's not 14.8 trillion words. It's roughly 11 trillion words. When people compare the dataset sizes of different models, the comparison is only meaningful if you account for the tokenization each model uses.

Beyond next-token prediction

For years, the training objective for LLMs has been the same: predict the next token. All GPTs from GPT-1 from 2018 to GPT-5 (presumably) were trained this way. The architectures changed, the scale changed, but the core objective remained fixed.

DeepSeek R1 is one of the first frontier models to modify this. Instead of predicting only the next token, DeepSeek V3, the base model behind DeepSeek R1, during training, predicted he first and second tokens ahead.

This has two benefits. First, it makes the training signal denser: the model that tries to predict two tokens in a row extracts more learning from the same data, which matters when high-quality training data is finite and expensive.

Second, it forces the model to plan ahead. To predict not just the next token but also the one after that, the model has to encode information about where the text is going, not just what comes immediately next. DeepSeek compared models trained with and without multi-token prediction and found consistent improvements with their approach.

After training is done, the extra prediction mechanism was discarded. The main model works fine on its own.

It's worth noting that most other AI developers have not adopted this approach, at least not publicly. Olmo 3, which was trained after DeepSeek V3, uses standard next-token prediction. Multi-token prediction is promising but not yet the default.

The loss curve is not enough

During pre-training, the most direct measure of progress is the loss: it is essentially a number that captures how wrong the model's predictions from the text it's trained on average. The lower the loss, the better the model predictions are. As training proceeds, the loss goes down, which means that the model learns to imitate human writing better.

But the loss is a single number that averages over everything. It doesn't tell you whether the model is getting better at math, or at code, or at following instructions. A model's loss might decrease steadily while its reasoning ability stalls and its coding ability improves. Or if the data used late in training significantly differs from the data from early in training, it hurts the capabilities acquired early.

This creates a practical problem. A pre-training run for a frontier model costs millions of dollars and runs for weeks or months. The people running it need to know whether things are going well, not just on average, but for the specific capabilities they care about. If a particular type of training data is hurting more than it helps, you want to find out about it as soon as possible, not after the run is finished.

The obvious solution is to run benchmarks during training. Test the model on math problems, coding tasks, and reading comprehension. But this is harder than it sounds.

How Olmo 3 monitors training

The first problem is knowing which benchmarks to use. Not all benchmarks are useful at all stages of training. Some tasks are pure noise early on: a half-trained model scores nothing on hard math problems regardless of how its training is going. Other tasks saturate early: the model hits near-perfect accuracy on easy benchmarks well before training ends, so they stop being informative. You need benchmarks that give a meaningful signal at the model size and the training stage you're currently at.

The Olmo 3 team built a dedicated evaluation suite called OlmoBaseEval with 43 benchmarks. To figure out which benchmarks are useful at which stage, they ran a scaling analysis: they collected over 23,000 benchmark scores from 70 different open-weight models at various scales and sorted the benchmark questions into groups that are useful earlier and later in training. Some tasks give a useful signal early in training, while other tasks only become meaningful for larger models later in training.

For early-stage evaluation, they use a trick. Instead of checking whether the model gets the answer right or wrong, which is noisy when the model is still half-trained, they measure something more fine-grained: how surprised the model is by the correct answer. A model that assigns 40% probability to the right answer is doing better than one that assigns 5% to it, even if neither would pick it as their top choice. This can detect small improvements during training, rather than a binary signal pass/fail that stays mostly at "fail" until the model suddenly starts getting the answer right.

The second problem is that evaluation itself is expensive. To get accurate benchmark results mid-training, you normally need to pause and do extra work: a process called learning rate annealing, where you temporarily adjust training settings to stabilize the model outputs before testing it. This takes time and compute.

The Olmo 3 team found a cheaper alternative. Instead of annealing, they take four recent snapshots of the model separated by an hour of training, and average their parameters together. This averaging has a smoothing effect that produces results close to what you'd get from a proper anneal, but costs almost nothing.

Together, these techniques let the team track specific capabilities throughout training without burning excessive compute on evaluation. They can see whether math performance is improving, whether code ability is stalling, and whether a change to the data mixture helped or hurt, all while the training run is still going.

Why this matters

Pre-training looks straightforward on the surface. In practice, it's an exercise in figuring out how to measure what you actually want the model to learn.

The training objective gives you a single number that goes down. But you care about dozens of capabilities, and the relationship between the loss and each specific capability is indirect. The teams that do pre-training well are not just the ones with the most GPUs. They are the ones that build the instrumentation to see what's actually happening during training, catch problems early, and adjust course before millions of dollars of compute are wasted.

The training objective also shapes what the model can learn in the first place. Standard next-token prediction has been the default for nearly a decade, and it works remarkably well. DeepSeek's multi-token prediction suggests there's room to squeeze more learning from the same data. As high-quality training data becomes scarcer and more expensive, innovations in the objective itself may matter more.

But even the best training objective can only extract what's in the data. The next posts in this series cover the decisions that determine what the model actually sees during pre-training: how different data sources are mixed together, and how they are filtered and deduplicated before training begins.