The Death of RLHF: A Practitioner’s Guide to the New Post-Training Stack

GRPO, DAPO, and RLVR didn’t just improve on RLHF — they replaced it. Here’s why the old recipe broke, and what’s actually shipping now.

A year ago, the post-training recipe was settled: pretrain on trillions of tokens, run supervised fine-tuning (SFT) on curated instruction data, then use reinforcement learning from human feedback (RLHF) to align the model. Every lab had some variant of this pipeline. It was expensive, slow, and opaque — but it worked, and everyone used it.

That recipe is now dead.

Every major reasoning model released in the past year — DeepSeek-R1, Nemotron 3 Super, Qwen3 — uses a fundamentally different post-training stack. The methods changed, the reward sources changed, and the results changed with them. And yet most explanations of “how LLMs are trained” still describe RLHF as if it’s current. This post closes that gap.

TL;DR

RLHF’s core problem wasn’t the feedback — it was the bottleneck of human annotation and the computational expense of PPO’s four-model setup.
GRPO eliminates the critic and reward model, making RL feasible on a single GPU. It samples multiple completions per prompt and uses group statistics to compute advantages.
RLVR replaces human judgment with programmatic verifiers (unit tests, math checkers) for reasoning tasks — faster, cheaper, and more consistent.
DAPO fixes the instabilities that appear when you scale GRPO to long chain-of-thought outputs.
The new stack is modular: SFT for instruction following, DPO/SimPO for alignment, GRPO+RLVR for reasoning. These solve different problems and stack in a specific order.

Why RLHF Broke Down

RLHF, as implemented in early ChatGPT and Claude models, had a seductive structure: gather human preference data, train a reward model on it, and use PPO to optimize the language model against that reward signal. In practice, it ran into three compounding problems.

The annotation bottleneck. Human preference labels are the fuel for RLHF. They’re also slow, expensive, inconsistent across annotators, and impossible to scale to the volume needed for frontier model training. You can’t label your way to a model that can solve AIME problems — humans can’t reliably rank mathematical reasoning at that level.

The four-model memory problem. PPO requires four models to be live simultaneously: the policy model you’re training, a frozen reference copy of that model (to compute KL divergence), a reward model (trained on human preference data), and a critic/value model (to estimate future rewards). For a 70B parameter model, this is a severe infrastructure constraint. The memory alone was forcing labs to distribute across dozens of GPUs just for the RL phase.

Reward model drift. The reward model is a separate neural network trained on human comparisons. It has its own failure modes: it can be gamed by the policy (reward hacking), it encodes annotator biases, and it becomes stale as the policy improves beyond the distribution on which the reward model was trained. The policy eventually learns to produce outputs the reward model rates highly, rather than outputs that are actually good.

The solution that emerged wasn’t to patch RLHF — it was to eliminate the components responsible for all three problems.

GRPO: Killing the Critic

Group Relative Policy Optimization (GRPO), introduced by the DeepSeek team in their DeepSeekMath paper, makes a single architectural change that cascades into massive practical benefits: it eliminates the critic model.

In PPO, the critic estimates the value of each state — how much expected future reward does the model get from this point in the sequence. GRPO replaces this with a simpler idea: sample a group of completions for the same prompt, score all of them, and use the group’s statistics as the baseline.

The advantage of any completion is:

A_i = (r_i - mean(r_1...r_G)) / std(r_1...r_G)

Where r_i is the reward for completion i, and G is the group size (typically 4–8 completions). Completions that score better than average get positive advantages; those that score worse get negative advantages. No critic needed — the group itself provides the baseline.

Here’s a minimal GRPO training loop using Hugging Face TRL:

from trl import GRPOConfig, GRPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
def reward_fn(completions: list[str], **kwargs) -> list[float]:
    """
    Verifiable reward: 1.0 if the answer is correct, 0.0 otherwise.
    This replaces the human-labeled reward model entirely.
    """
    rewards = []
    for completion in completions:
        # Extract the answer from the completion (assumes <answer>...</answer> tags)
        answer = extract_answer(completion)
        correct = check_answer(answer, kwargs["ground_truth"])
        rewards.append(1.0 if correct else 0.0)
    return rewards
config = GRPOConfig(
    num_generations=8,          # G - completions per prompt
    learning_rate=1e-6,
    max_completion_length=512,
    beta=0.0,                   # No KL penalty (DAPO finding: hurts reasoning tasks)
    output_dir="./grpo_output",
)
trainer = GRPOTrainer(
    model=model,
    reward_funcs=reward_fn,
    args=config,
    train_dataset=math_dataset,
)
trainer.train()

What this buys you: no separate critic model means you’ve cut your memory footprint roughly in half compared to PPO. No separate reward model means you don’t need the expensive human annotation pipeline. And because GRPO makes RL feasible on a single GPU for small models, it opened the door to wide experimentation — which is why the community moved fast.

RLVR: Replacing Human Judges With Verifiers

The second major shift is about where the reward signal comes from. RLHF gets a reward from human judgment. RLVR — Reinforcement Learning with Verifiable Rewards — gets reward from programmatic verifiers.

The insight is narrow but powerful: for math, code, and structured reasoning tasks, you don’t need a human to judge quality. A unit test tells you unambiguously if the code runs correctly. A math checker tells you if the final answer matches. A proof verifier tells you if the logic is valid. These signals are binary, instant, consistent, and infinitely scalable.

DeepSeek-R1-Zero demonstrated the starkest version of this: a model trained purely with RLVR (no SFT warmup, no human preference data) spontaneously developed chain-of-thought reasoning, self-reflection, and dynamic strategy adaptation — emergent behaviours that weren’t explicitly trained for. The model learned to “think out loud” because longer, more careful reasoning consistently produced correct answers, which the verifier rewarded.

def math_verifier(completion: str, ground_truth: str) -> float:
    """
    Example verifier for math problems.
    Extracts a numerical answer and checks exact match (or tolerance).
    """
    import re
    # Look for the final boxed answer in LaTeX format
    match = re.search(r'\\boxed\{([^}]+)\}', completion)
    if not match:
        return 0.0
    
    predicted = match.group(1).strip()
    try:
        # Numerical comparison with tolerance
        pred_val = float(eval(predicted))
        true_val = float(eval(ground_truth))
        return 1.0 if abs(pred_val - true_val) < 1e-6 else 0.0
    except:
        # String match fallback
        return 1.0 if predicted == ground_truth.strip() else 0.0

def code_verifier(completion: str, test_cases: list[dict]) -> float:
    """
    Example verifier for code generation.
    Runs the extracted code against unit tests.
    """
    import subprocess, tempfile, os
    
    code = extract_code_block(completion)
    if not code:
        return 0.0
    
    passed = 0
    for test in test_cases:
        full_code = code + "\n" + test["test"]
        with tempfile.NamedTemporaryFile(mode='w', suffix='.py', delete=False) as f:
            f.write(full_code)
            fname = f.name
        try:
            result = subprocess.run(
                ["python", fname], timeout=5, capture_output=True
            )
            if result.returncode == 0:
                passed += 1
        except subprocess.TimeoutExpired:
            pass
        finally:
            os.unlink(fname)
    
    return passed / len(test_cases)

The limitation is equally clear: RLVR only works where verification is tractable. Math and code have clear verifiers. “Write a good product description” does not. For open-ended alignment tasks — helpfulness, harmlessness, nuanced human preferences — human feedback or an AI proxy for it remains necessary. RLHF isn’t dead everywhere; it’s dead for the reasoning layer.

DAPO: Fixing What GRPO Gets Wrong at Scale

GRPO works well on small models and short outputs. When you scale to large models and long chain-of-thought sequences, three specific failure modes appear. DAPO (Decoupled clip and Dynamic sAmpling Policy Optimization, ByteDance/Tsinghua 2025) addresses all three.

Problem 1: Length bias in the loss. Standard GRPO normalizes the loss by dividing by sequence length — effectively averaging the loss per token. This sounds neutral but isn’t: short responses get disproportionately large gradient updates, while long responses (which contain the actual reasoning) have their gradients diluted. The model is implicitly incentivized to be brief even when correctness requires length.

DAPO’s fix: token-level policy gradient loss. Instead of normalizing per sequence, compute loss across all tokens in the batch equally. Every token contributes equally regardless of which response it came from.

Problem 2: Entropy collapse. As training progresses, the model’s output distribution can collapse — it stops exploring and always generates the same response patterns. This is particularly damaging for reasoning tasks where diverse exploration is how the model discovers better strategies.

DAPO’s fix: Clip-Higher. Standard PPO/GRPO uses a symmetric clip range [1-ε, 1+ε] on the policy ratio. DAPO decouples this into separate bounds: a tighter lower bound to prevent over-correction away from bad responses, and a higher upper bound to allow the model to confidently reinforce good strategies without being clipped. This keeps entropy from collapsing while still maintaining stability.

Problem 3: Vanishing gradients on uniform batches. If all 8 completions for a prompt get the same reward (all correct or all wrong), the group normalization produces advantages of zero — no gradient signal, wasted compute. GRPO simply skips learning from these prompts.

DAPO’s fix: Dynamic Sampling. Keep sampling new prompts and responses until every batch has non-zero reward variance. Informationally empty batches don’t contribute to training.

from trl import GRPOConfig

# Standard GRPO config (baseline)
grpo_config = GRPOConfig(
    num_generations=8,
    beta=0.04,                          # KL penalty
    loss_type="grpo",                   # Sequence-level loss
)

# DAPO-style config via TRL
dapo_config = GRPOConfig(
    num_generations=8,
    beta=0.0,                           # No KL penalty — DAPO finding
    loss_type="token_level",            # Token-level loss — fixes length bias
    epsilon=0.2,                        # Lower clip bound
    epsilon_high=0.28,                  # Higher upper clip bound — Clip-Higher
    mask_truncated_completions=True,    # Ignore cut-off responses
    # Dynamic sampling is handled by the trainer with a filter_fn
    # that resamples batches with zero variance in rewards
)

On AIME 2024, DAPO trained Qwen2.5–32B to 50 points, outperforming DeepSeek-R1-Zero with 50% fewer training steps. The full system is open-sourced.

The Modular Stack: How It All Fits Together

The clearest mental model for the new post-training pipeline is that it’s modular — each stage solves a specific problem that the previous stage doesn’t address.

1. Pretraining (next-token prediction)
   └── Objective: language understanding, world knowledge, basic reasoning
   └── Scale: trillions of tokens

2. Supervised Fine-Tuning (SFT)
   └── Objective: instruction following — teaching the model response format
   └── Data: curated instruction-response pairs
   └── No RL; just cross-entropy loss on demonstration data

3. Preference Optimization (DPO / SimPO / KTO)
   └── Objective: alignment — making outputs helpful and non-harmful
   └── Data: preference pairs (response A preferred to response B)
   └── Replaces the RLHF reward model + PPO with a simpler contrastive objective
   └── Does NOT require online generation

4. Reasoning RL (GRPO + RLVR, with DAPO fixes)
   └── Objective: push reasoning quality beyond what SFT data can provide
   └── Data: math/code problems with verifiable answers
   └── Online generation: model generates its own training data
   └── Reward: binary signal from programmatic verifiers

The key insight is that steps 3 and 4 solve different problems. DPO aligns the model with human values — it’s about what outputs are acceptable. GRPO+RLVR improves reasoning ability — it’s about making the model smarter within the aligned space. Running them in the wrong order or conflating their objectives is a common mistake.

What This Means for Practitioners

Fine-tuning for reasoning tasks is now tractable on modest hardware. Because GRPO eliminates the critic and value model, you can run RLVR fine-tuning on a 7B model with two consumer GPUs. TRL exposes this through GRPOTrainer. If you have a domain with verifiable answers — medical coding, legal clause classification, structured data extraction — you can now train reasoning capability into a small model without a multi-machine setup.

Your reward function is now your biggest lever. In the old RLHF world, the reward model quality was opaque — it was a neural network trained on thousands of human comparisons. In RLVR, your reward function is code you write. Writing a good verifier is the new hyperparameter tuning. A reward function that’s too easy (model games it), too hard (no positive signal), or poorly specified (rewards surface features instead of correctness) will produce a bad model regardless of everything else.

The strange finding about spurious rewards. One counterintuitive recent result: researchers showed that RLVR on Qwen2.5-Math models with completely random or incorrect rewards still produced benchmark gains. This suggests some of the improvement attributed to “learning from verifiable rewards” may actually be coming from the training dynamics of RL itself — the online generation, the diverse exploration, the gradient signal from group comparison — rather than the specific content of the reward signal. The field is still digesting this. It’s worth being skeptical of strong causal claims about why RLVR works.

Gotchas Nobody Tells You

Reward hacking didn’t go away — it just changed shape. With human reward models, reward hacking meant producing outputs that sounded good but weren’t. With verifiable rewards, it means learning to game the verifier. A math model that learns to output the correct final number without understanding the problem, by pattern-matching surface features of training examples, will pass the verifier while failing on distribution shifts. Design your verification to be robust, not just binary.

Long CoT outputs create training instability that’s easy to miss. If you’re running GRPO without DAPO’s fixes and your completions are frequently hitting the max length limit, you’re training on truncated reasoning chains. The model gets penalized for outputs it didn’t finish. This creates a subtle pressure toward shorter, less thorough responses — exactly the opposite of what you want for reasoning. Always set mask_truncated_completions=True and monitor your truncation rate.

DPO and GRPO interact in non-obvious ways. If you run DPO for alignment before GRPO for reasoning, the DPO training introduces a KL constraint relative to the SFT model. When GRPO then runs, it’s optimizing against a distribution that’s already been shifted by DPO. The reference model you set for GRPO matters — using the SFT model vs. the DPO-aligned model as the GRPO reference produces meaningfully different results. There’s no universal answer here; it’s empirical and task-dependent.

Conclusion

The new post-training stack is not a revolution in the sense of a single breakthrough — it’s an accumulation of targeted fixes to specific failure modes in the old pipeline. GRPO fixed the memory and annotation bottleneck of PPO. RLVR fixed the reward model bottleneck for tasks with programmatic verifiers. DAPO fixed the training instabilities that appeared when you scaled GRPO to long reasoning chains.

What’s genuinely surprising is how much this opened up. The move from four-model PPO to two-model GRPO (policy + optional reference) made reasoning RL accessible to researchers and practitioners who couldn’t run frontier infrastructure. The move from human labels to verifiable rewards removed the most expensive and slowest part of the pipeline. DeepSeek-R1 demonstrated what you get when you combine both — and the field moved fast.

RLHF isn’t gone. For alignment, for nuanced preference learning, for tasks where correctness can’t be verified programmatically, it remains the right tool. What’s gone is RLHF as the only technique in the post-training stack — the assumption that every improvement in model behaviour had to flow through human preference labels and PPO.

The question for 2026 is what verifiers can be built for domains beyond math and code. Medical diagnosis, legal reasoning, scientific claims — these have verifiable criteria, even if they’re harder to program. The teams that figure out how to express domain expertise as reward functions will have access to the same flywheel that produced DeepSeek-R1.

The Death of RLHF: A Practitioner’s Guide to the New Post-Training Stack was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.