Apr 8, 2026

Preference Training: How Models Learn What Humans Want

In the previous posts, we covered pre-training and supervised fine-tuning (SFT). SFT teaches a model to behave like an assistant by showing examples of good AI responses. But SFT has a concrete limitation: a human is required to write good responses. For straightforward questions, this works fine. For more complex and subjective ones, such as explaining a detailed topic or responding to an emotionally charged message, there is no single correct answer. Writing one "correct" demonstration means making an arbitrary choice among many reasonable options.

Reinforcement learning from human feedback (RLHF) solves this differently. Instead of requiring someone to input the right answer, you let the model generate several answers and have a human pick the best one. Comparing two responses is much easier than writing a perfect response from scratch, scaling to questions with several good AI responses.

The technique works in three steps. First, the model that underwent pre-training and SFT generates multiple responses to the same prompt. Humans then compare these responses and rank them. Next, a separate system called a reward model is trained based on these rankings. The reward model learns to predict which responses humans would prefer, assigning a numerical score to responses. Last, the original model is trained with reinforcement learning to maximize the reward model's scores.

A reasonable question is why this process requires a separate reward model. If humans rank the responses, why not train the model directly on those rankings? The problem is that research learning (RL) training requires running the model against thousands of prompts and scoring each output. You can't use a human rater to grade each model response and feed this decision directly to improve model performance. This method would significantly delay model release while also costing a fortune to have people evaluate every single output during training. Additionally, you won't be able to reuse human preferences for future models, while a separate reward model can be reused.

The reward model is a way to distill human preference into something cheap and fast. You collect tens of thousands of human judgments, train a reward model to imitate those judgments, and then use that model as an automated judge for the millions of evaluations that RL training requires.

This technique was the key ingredient that turned language models into usable assistants. OpenAI's InstructGPT paper in 2022 emphasized that a 1.3 billion parameter model trained with RLHF was preferred by human evaluators over the 175 billion parameter GPT-3. A model that was over a hundred times smaller won on usefulness because it learned what humans actually wanted, not just how to predict the next word. ChatGPT, released later that year, used the same approach. The models before it were knowledgeable but erratic: sometimes producing thoughtful responses and other times producing nonsense. RLHF was what made the difference between a text predictor and an assistant.

Why RLHF breaks down at scale

Despite the progress RLHF has made, it has problems that get worse as models improve.

The first being cost. Human preference data is expensive. Each comparison requires a trained rater to read a prompt, carefully review several model responses, and decide which is better. For domains that require specialized knowledge, such as medicine or law, raters need genuine expertise, costing more. According to estimates, a single high-quality preference comparison costs several dollars per comparison. You need thousands to train one reward model. Additionally, when you train a more capable model, you often will need to train a new reward model.

The second problem is reward hacking. The model is trained to maximize the reward model's score, but the reward model itself is an imperfect approximation of human preference. When you optimize hard against an imperfect proxy, the model finds ways to score well without being better. For example, viewsycophancy: models learn that humans tend to prefer responses that agree with them, so RLHF-trained models become agreeable, even when the user is wrong. The reward model gives high scores to confident, polished, validating responses, and the model learns to produce those regardless of their accuracy.

A general pattern is that the model learns the surface features of preferred responses instead of the underlying quality. Longer responses tend to receive higher ratings from annotators, so the model learns to be verbose. Responses with confident phrasing score well; thus, the model learns to sound certain even when it’s not. The reward model captures these correlations from the training data, and RL amplifies them.

The last challenge is that humans become worse judges as the models improve. A human rater can reliably compare two short conversational responses. Though comparing two detailed technical explanations, two long code solutions, or two multi-step reasoning chains is a different task. As model outputs get longer and more sophisticated, the gap between what the model produces and what a human rater can evaluate in a reasonable time grows. The reward model is only as good as the human judgments it was trained on.

Two parallel shifts

These problems motivated two changes to how preference training works. Happening in parallel and addressing different parts of the pipeline, they transformed the field together.

The first change concerns how preference data is utilized.

In 2023, a group at Stanford University introduced Direct Preference Optimization (DPO), which collapses RLHF into a single step. The core insight is that you don't actually need a separate reward model. You treat the language model itself as an implicit reward model and optimize it directly on the preference pairs.

The difference is what happens during the training phase. Classic RLHF generates new model outputs at every training step, requiring a score for each one, explaining why a fast, automated judge is needed. DPO sidesteps this entirely. It never generates new outputs during training. Instead, it works only with the preference pairs you already collected. For each pair, the model is prompted to increase the probability of the preferred response and decrease the probability of the rejected one. Because DPO doesn’t evaluate anything new, a reward model is not required.

The practical benefits are significant. DPO is cheaper to run because it does not require a separate reward model and is far simpler to implement. The training loop appears like standard supervised fine-tuning with a different loss function.

DPO became a default tool in post-training within a year of its publication. Many models released in 2025 use it as part of their preference-training pipelines. But DPO only changes the optimization method. It still requires preference pairs, and they still need to come from somewhere. The more consequential change is who provides the preferences.

Constitutional AI: replacing human raters with principles

The most influential answer came from Anthropic in late 2022, incorporating a technique called Constitutional AI. The idea was to replace human raters with a written set of principles, a "constitution," and have the AI evaluate its own outputs. Examples of the principles from Claude's Constitution focus on being safe, ethical, and helpful.

The process has two phases. In the first, the model generates a response to a prompt. It critiques its own response against a randomly selected principle from the Constitution. To illustrate, a critique may be "choose the response that is most helpful and least harmful."Then, it revises the response based on its own critique. This revised response becomes training data for supervised fine-tuning.

In the second phase, the model generates pairs of responses, and another AI model then evaluates which response better satisfies a constitutional principle. These AI-generated preferences replace the human-generated preferences that classical RLHF requires. Though Anthropic still uses human preference data for some prompt categories. Next, you train a reward model on this AI preference data and run RL against it, exactly as in RLHF. The only difference is that no human ever rated an individual output. The only human contribution is creating the constitution: a document that encodes what "good" means in natural language.

Anthropic named the RL phase of this approach reinforcement learning from AI feedback (RLAIF). The cost difference is dramatic. A piece of AI preference data costs less than a cent, much less than human preference data. This makes it feasible to generate preference data at a scale that would be prohibitive with human raters.

But Constitutional AI did not just reduce cost. It also removed humans from negative areas of the field. Humans who evaluate harmful content are exposed to that content as part of their job. Safety training requires showing raters toxic, violent, or otherwise disturbing outputs so they can flag them. Constitutional AI largely removes humans from this loop: they only need to write the principles that AI then uses to evaluate harmful content.

The tradeoff is that AI feedback has different failure modes than human feedback. AI judges can be confidently wrong in ways that human raters wouldn't be. For example, they can miss subtle issues that a human would catch or inherit the biases of the model's judgment. Constitutional AI doesn't eliminate the need for human judgment – it moves it upstream..

What preference training looks like in 2025

Every major lab has likely diverged from classical RLHF. Three open models illustrate the range of what preference training currently looks like.

Olmo 3, the most transparent model in the series we've been following, uses DPO with a twist. Instead of collecting preference pairs from human raters or AI judges, they construct the pairs from the outputs of two models with very different capabilities. The preferred response comes from a medium-sized model named Qwen 3 32B. The rejected response comes from Qwen 3 0.6B, a weaker model. The same prompt, two models, and the stronger model's output is always preferred.

The team hypothesized that the model learns from the delta between the two responses, not from each learner's absolute quality. Their results supported this: the response pairs from Qwen 3 32B and Qwen3 0.6B produced stronger gains than expected. This strategy eliminates preference collection. No human raters, AI judges, or constitutional principles were involved. The "preference" is the capability gap between two existing models.

DeepSeek V3.2 takes a different approach. Their RL stage is conceptually similar to the algorithm from classical RLHF, but splits the world into two categories. For tasks with verifiable answers, such as math problems or code, the reward is simple and objective. They review whether the answer passed the required tests or if the proof is correct. Hence, no reward model is needed.

For tasks without verifiable answers, like open-ended writing or illustrating a concept, they use a generative reward model. Suppose the prompt is "explain why the sky is blue to a ten-year-old." A classical reward model would assign a generic quality score based on patterns it had learned from thousands of human preference comparisons across all kinds of prompts. Instead, the generative reward model produces a rubric specific to the prompt. Rubric points include asking whether the explanation is scientifically accurate, if the language is appropriate for a child, or if it avoids complex terms. Then, the model scores the response against these criteria rather than producing a single generic quality number. Because the rubrics are tailored to each prompt, the feedback is more precise than what a one-size-fits-all reward model can provide.

DeepSeek also merges reasoning, agent, and alignment training into a single RL stage rather than running them sequentially. They found that mixing different reward signals produces a model that generalizes better. Separating the RL stages created a catastrophic forgetting of material. For example, when a model learned coding, its entire alignment degraded.

Kimi K2 team took the idea of AI-generated feedback furthest. Their framework utilizes both verifiable rewards and what they call a self-critique rubric reward, and the interaction between the two is the interesting part.

For tasks with clear right and wrong answers, Kimi K2 uses verifiable rewards similar to DeepSeek. The model solved math problems, wrote code, and achieved a binary signal regardless of its correctness.

For subjective tasks, the model acted as its own judge. It generated several responses to a prompt, then evaluated them in pairs against a set of rubrics. The rubrics come in three types: core, perspective, and human-automated. Core rubrics encode the values developers want the model to have: It must be clear, helpful, and honest. Prescriptive rubrics exist specifically to prevent reward hacking: they penalize shortcuts that models typically learn during RL, such as verbosity or sycophancy. Human-annotated rubrics are written by the data team for specific types of prompts where generic rubrics are not sufficient.

Using the model as its own judge creates a problem. A model that grades its own homework has every reason to give itself good marks. At the end of the day, if it did not consider these responses to be good, it would not generate them.

The Kimi team's solution was to use the verifiable tasks as a reality check for the subjective judge. During training, the model solved thousands of math and coding problems with objective answers. These results were used to continuously retrain the model handling subjective judgments. If the critic started drifting, such as giving high scores to dense but shallow explanations, the verifiable tasks would catch it. A critic that rewarded verbosity over substance would also misjudge verifiable tasks where the verbose answer is wrong, and the concise answer is right. The objective signal corrected the drift. Think of it as a teacher who grades both math tests and essays. You can check whether her math grading is correct because math has objective answers. If her math grading stays sharp, you have more reason to trust her essay grading as well. Though if her math grading starts slipping, that is a signal to recalibrate before the essay grades get unreliable. The Kimi team did not need every task to be verifiable. They needed enough verifiable tasks to keep the subjective judge honest.

This is a meaningful insight. The problem with pure AI feedback has always been that the judge can drift from accuracy. It can develop blind spots, “reward hack” itself, or gradually lower its own standards. Kimi's design utilizes verifiable rewards as an anchor. You don't need every task to be verifiable. You just need enough verifiable tasks to keep the subjective judge honest.

Humans haven't left the building

Despite these advances, to the best of our knowledge, frontier AI developers still utilize humans for preference training.

Llama 4, released by Meta in 2025, still uses human preference data in its DPO stage, but Meta explicitly describes this as a "lightweight" stage. Anthropic used human preference data for helpfulness alongside Constitutional AI for harmlessness, and is likely to continue using human data. Kimi K2's self-critique system was bootstrapped from a mix of open-source and in-house preference datasets, some of which included human-made labels. The closed loop eventually runs on its own, but it needed human judgment to get started.

A research direction called reinforcement learning from targeted human feedback (RLTHF), makes the pattern explicit. Instead of having humans review everything, you use AI to label the easy cases and route only the ambiguous ones to humans. The researchers found that targeting just 6-7% of the data for human annotation achieved the same alignment quality as labeling the entire dataset by hand.

The pattern across all of these approaches is the same as what we identified in an earlier post about the role of human data in AI training. Human contribution is not disappearing. Instead, it is moving upstream. In 2022, humans rated individual model outputs. In 2025, humans write constitutions, design rubrics, construct verification systems that keep AI judges calibrated, and only intervene in the hardest cases that AI judges can't resolve. The per-unit labor has decreased dramatically, but the judgment required for each remaining unit has increased.

Conclusion

Preference training has gone through a full cycle in three years. Classical RLHF proves that human preferences can turn a text predictor into a useful assistant. DPO shows that the reward model and RL algorithm were unnecessary if you had preference pairs. Constitutional AI demonstrated that the human raters themselves could be replaced by a set of written principles and an AI judge. More recent systems, such as Kimi K2's closed-loop critic, validated that even the AI judge can be kept honest if anchored to tasks with verifiable answers.

Two things happened simultaneously. The optimization method got simpler. We moved from the three-step RLHF pipeline to DPO's single training pass. Next, the feedback source shifted: from human raters to AI judges guided by constitutions, rubrics, and capability gaps between models. These are independent developments arriving at the same time, and modern post-training pipelines combine the two.

The result is that preference training is no longer in the expensive, fragile stage it was in 2022. Though it still depends on the quality of the signal that drives it. For verifiable tasks, that signal comes from test suites and automated checks. For subjective tasks, it comes from rubrics and AI judges that are still designed and calibrated by humans.. The verification infrastructure underneath preference training has become as important as the preference data itself.