Apr 8, 2026

Supervised Fine-Tuning (SFT): How Models Learn from Demonstrations

Previously, we covered pre-training. Now, we move to post-training. A pre-trained model is knowledgeable about the world and has learned to complete texts. However, it doesn't yet know how to be a helpful assistant. If you ask it a question, it might generate five more questions. This is because it is predicting, and often the next words in a text containing a question are more questions. Post-training is the stage that turns this raw text predictor into a useful tool.

Post-training has several steps, with supervised fine-tuning (SFT) typically coming first. The idea is simple. You show the model examples of the behavior you want: a user asks something, and a helpful assistant responds. The model learns from these demonstrations the same way it learned during pre-training, by predicting the next token of the desired answer. Though the scale is much smaller than in pre-training. Instead of tens of billions of examples, in SFT, there are only thousands to millions.

For a long time, SFT focused on imitating humans. SFT developers hired experts to write good demonstrations. Anthropic, OpenAI, and other frontier labs invested heavily in human annotation pipelines, and for several years, this was the main competitive dimension in post-training.

This has changed. Now, SFT datasets are often fully AI-generated. A teacher model produces the training data. Verifier models then filter and leave the good samples for training purposes. This pushes human supervision upstream to the design of the pipeline rather than the authoring of individual examples.

So, if you are using SFT to train a model, you usually need a model that is superior with the task than the model you train. For smaller models, it's not a big problem. Big frontier models can provide training samples, but when SFT is used for training larger and more capable models, finding a good teacher becomes difficult.

We will discuss the SFT of two recent open-source models: Olmo 3 32B and DeepSeek V3.2 to explore this in-depth.

Olmo 3: distilling from what's available

The team behind Olmo 3 models is more transparent about post-training than almost any other lab, releasing their entire SFT dataset. Their dataset contains roughly 2.3 million examples spanning mathematics, science, code, instruction-following, chat, and safety. The prompts they used were pulled from a wide range of public datasets. Some of the prompts provided good completions, while others didn't. Hence, the Olmo 3 team generated completions for their prompts using DeepSeek R1 or QwQ-32B models. The developers then applied domain-specific filters, checking for correctness. For example, they utilized synthetically generated test cases for code and rule-based verifiers to ensure precise instruction-following. Completions that failed the checks were discarded. The passing examples were used for SFT.

They trained two copies of the model on the dataset: one with a high learning rate, and the other with a low learning rate. After, they averaged the weights of the two resulting models together. This is called weight merging. The idea is that a low learning rate and a high learning rate produce strong models in slightly different ways, and averaging their weights gives you a model that combines the strengths while minimizing the downside of both.

The limitation of this approach is that the student model can't surpass the teacher. This didn’t conflict with what the team was building: they wanted the best fully-open 32B model, and external distillation got them there at a reasonable cost. But if you're trying to push past the current frontier rather than catch up to it, you can't distill from a teacher that doesn't exist yet.

This is the problem the DeepSeek V3.2 developer team solved.

DeepSeek V3.2: building your own teachers

DeepSeek V3.2 is a large model comparable to GPT-5 on most reasoning benchmarks. Its high-compute variant even achieved a gold-medal performance at the International Mathematical Olympiad. It is hard to find an external teacher strong enough to distill at this level. If DeepSeek wants to use distillation for SFT, they have to build the teachers themselves.

This is exactly what the V3.2 post-training pipeline did. The team took the same V3.2 pre-trained base model and trained seven specialist copies of it. Each specialist focused on a specific domain: writing, math, programming, general logical reasoning, general agentic tasks, agentic coding, and agentic search. Each domain was trained with Reinforcement Learning (RL) in its respective field.

The specialists weren't the final product. They existed only to generate training data. Once each specialist reached a strong performance level, the team used it to produce completions for a large set of prompts in each domain. A math specialist generated math solutions. An agentic coding specialist generated agentic coding trajectories. These completions become SFT training data for the final V3.2 model, which was trained to absorb the capabilities of all seven specialists at once.

The DeepSeek team reported that a model trained on this distilled data performs only marginally below the specialists themselves. A subsequent RL stage on the generalist closed the remaining gap. The generalist matched with the specialists across all seven domains without needing seven separate models at inference time.

For the agentic domains, the team built a large-scale task-synthesis pipeline encompassing over 1,800 distinct environments and 85,000 prompts. These environments served two purposes. The agentic specialists were trained in them during RL, learning by interacting with them and receiving feedback. Second, once the specialists were trained, they generated trajectories in the same environments, and those trajectories became the SFT data for the generalist. So, the same set of environments was used twice: once to make the specialists strong, and once to let them produce training examples.

There is one notable finding from the V3.2 paper worth highlighting. After the SFT stage, the team ran an additional RL pass using only general agent data. This resulted in the model improving substantially on three agentic benchmarks: Tau2Bench, MCP-Mark, and MCP-Universe. They also tried a different version of the same experiment, where the RL used only agentic code and search data instead. This version didn't improve the model on the three benchmarks at all. Both kinds of data look similar on the surface, as they are all agentic tasks. But the synthetic general-agent data had a structural effect that the more specific code and search data didn't. The team did not fully explain why, but the takeaway is that the diversity of environments matters. A narrower set of environments, even if high-quality, did not provide the model the same type of generalization of capabilities.

Olmo 3 and DeepSeek V3.2 look different on the surface. One distills from external open models. The other builds its own specialists from scratch. Despite their differences, the two pipelines share several important properties.

Both rely almost entirely on generated completions rather than human-written ones. Olmo 3 is trained on completions from other models, while V3.2 uses completions from specialist versions of itself. In neither approach does a human sit down and write a strong answer to a prompt. Rather, the human effort is in designing the pipeline. This includes choosing prompts, configuring verifiers, and deciding which checks are important. The model produces the individual training examples utilised.

Both rely on verifier filtering to ensure high-quality work. In the two pipelines, a lot of generated completions are discarded: many candidates are generated, and the responses that pass the checks are kept, while the rest are discarded. This is the opposite of how SFT used to work. The old approach was "write one good answer per prompt." The new approach is "generate twenty, filter down to the best one or two." Computing is the cheaper alternative to human annotation, so generating and filtering at scale is now the new default.

Lastly, the two produce SFT datasets that are much larger than what humans could plausibly write. Olmo 3 used 2.3 million examples, and DeepSeek's SFT set is larger and spans seven domains. At these scales, human annotation is not realistic.

Where the two approaches diverge

Their shared mechanics are straightforward. The question is where the two pipelines make different choices, and why.

The first difference is who. Olmo 3's teacher is an external model that was trained by someone else. V3.2's teacher is a specialist whom DeepSeek has trained. This difference impacts what the final model can become. An externally distilled model inherits a ceiling from its teacher. If R1, the teacher model, doesn't know how to solve a problem well, neither will Olmo 3, because it imitates R1. A model distilled from its own specialists has no such ceiling from outside. The specialists are trained with RL, and RL goes past what any existing model can do.

The second difference is cost. Olmo 3's approach is cheaper than DeepSeeks. You run open models over a set of prompts, filter, and train. V3.2's approach is expensive. Training seven specialists with large-scale RL is resource-heavy before any SFT data gets generated. DeepSeek doesn't disclose the exact numbers, but they spent over 10% of pre-training compute on it.

The third difference is what each approach lets you scale. Olmo 3's ceiling rises only when better open teachers are released. V3.2's ceiling rises when the team builds better environments and better reward signals for specialist RL. This is controlled by DeepSeek directly. They can invest in environmental design, improve their verifiers, and expand their specialist set – the generalist improves as a direct result.

The choice between the two approaches isn't really about the SFT technique. It's about what kind of lab you are and what you're trying to build.

What this means for everyone else

The two approaches are placed in two different positions of the AI ecosystem. If you're building a model trying to catch up to the frontier, or match it at a smaller size, external distillation is the right answer. Spending compute on building your own specialists would be wasteful when you can use someone else's.

If you're building a model that's trying to push the frontier forward, external distillation doesn't work. There's no model above you to distill from. You have to generate your own training data somehow, and specialist RL is the current best way to do that. The cost is high, but it's the only path that doesn't cap your final model at the quality of the teachers available to you.

One thing worth noting is that both approaches are built on top of RL infrastructure, directly or indirectly. V3.2's specialists are trained with RL. The teachers that Olmo 3 distills from, R1 and QwQ, were themselves trained with RL. Even when SFT doesn't use RL directly, the quality of the SFT data is downstream of how well someone, somewhere, trained a model with RL. We'll cover this in the next post.

Conclusion

SFT used to be the stage where humans taught models how to behave. It isn't anymore. Olmo 3 and DeepSeek V3.2 both build their SFT datasets from model-generated completions filtered by verifiers, and the human contribution has moved up to the design of the pipeline. The two developers want to build different models, and this dictates different approaches: Olmo 3 aims to catch up to the frontier at a smaller size, so external distillation works, while V3.2 aims to push the frontier forward, so it has to build its own teachers.

One broader pattern is worth naming. Across recent frontier models, SFT is becoming a smaller part of the overall post-training effort. Most of the capability lift now comes from RL, making SFT an increasingly bootstrap stage, moving models to where RL can function properly. This shifts questions away from SFT data quality and toward valuing RL training, being the subject of future posts.

Olmo 3: distilling from what's available

DeepSeek V3.2: building your own teachers

What the two approaches share

Where the two approaches diverge

What this means for everyone else

Conclusion