On-Policy Distillation: Promise, Pitfalls, and Prospects

Jun 8 2026

Promise

The promise of on-policy distillation (OPD) (Gu et al., 2023; Agarwal et al., 2023; Lu and Thinking Machines Lab, 2025) comes from how it rearranges two fundamental ingredients, the policy that generates rollouts and the density of the learning signal attached to those rollouts. The first axis is on-policy versus off-policy, asking whether we train on trajectories sampled from the student itself or from an external teacher or dataset. The second axis is sparse versus dense supervision, asking whether the model receives only an outcome-level reward or a token-level signal along the trajectory. Supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR) occupy opposite corners of this space. SFT provides dense learning signals, but usually on off-policy trajectories. RLVR uses on-policy rollouts, but often reduces an entire reasoning trace to a sparse verifiable reward. OPD tries to combine the useful half of both, on-policy student rollouts with dense teacher feedback.

Each corner has a clear advantage and a clear limitation. SFT is information-rich because every token in the target trajectory can become a supervised signal, which makes it an effective way to transfer knowledge from the expert. However, this off-policy training recipe can be resource-intensive when scaling performance. The same mismatch also appears as the familiar compounding errors in imitation learning. During training, the model predicts the next token after clean expert prefixes; during deployment, every token is conditioned on prefixes generated by the model itself. If an early token drifts away from the demonstration distribution, later tokens are sampled from a shifted context, making the trajectory move even further into regions where the dataset provides little guidance.

Two-axis diagram comparing SFT, RLVR, and OPD by rollout source and supervision density — SFT gives dense supervision on off-policy expert trajectories. RLVR uses on-policy student rollouts but usually provides only a sparse outcome signal. OPD tries to keep the on-policy part of RLVR while recovering the dense token-level supervision of distillation.

RLVR takes the opposite trade-off from SFT. Its main strength is that the rollouts are on-policy, so the update is grounded in behavior the student actually produces. If the model gets stuck in a bad reasoning pattern, RLVR can train directly on that failure mode instead of only showing another expert demonstration. The limitation is the sparsity of the learning signal. A verifier may tell the student that a rollout succeeded or failed, but this feedback carries only a small amount of information no matter how many tokens the model used. For a long reasoning trace, the model may learn that the final answer was wrong without knowing where the mistake happened, or that the final answer was right without knowing which steps actually mattered. This makes credit assignment weak for both positive and negative samples.

OPD is appealing because it tries to keep the on-policy part of RLVR while recovering the dense supervision of distillation. The student still generates the trajectory, so the training data is aligned with the states the student actually visits. But instead of assigning a single reward to the whole rollout, a stronger teacher can provide token-level feedback along that student-generated trajectory, often through log-probabilities or a KL-style objective. In this sense, the teacher is no longer just a source of static demonstrations. It becomes a critic of the student’s own prefixes, telling the student what the expert would have preferred at the states where the student actually needs guidance. A closely related variant is on-policy self-distillation (OPSD) (Zhao et al., 2026). Here the teacher is usually the same student model, but run with some extra information, such as the ground-truth answer. For readability, I will use OPD as an umbrella term in the rest of this post, with OPSD treated as a variant unless the distinction matters. The source of the teacher signal is different, but many of the motivations, benefits, and failure modes are shared.

Pitfalls

The story, however, is not entirely positive. Several recent and concurrent works have identified failure modes of OPD. These pitfalls are worth understanding because they make one point clear: on-policy distillation is not a free lunch. Rather than going through each result separately, I will organize the discussion around three connected failure mechanisms: local corruption of teacher supervision, horizon-induced supervision decay, and myopic gradient from the teacher. These are not separate boxes. The second can amplify the first, and the third comes from the current training paradigm of OPD: post-doc supervision after students’ rollout.

Local Noise in Teacher Supervision

The first pitfall is a local target mismatch: OPD asks the teacher to provide token-level supervision on student-generated prefixes, but the resulting teacher distribution is not always a reliable training target. When a student-generated prefix is sparse or off-manifold for the teacher, the teacher’s next-token distribution can mix several different moves: steering the continuation back toward the teacher’s usual reasoning manifold, locally continuing the student’s prefix, or taking a plausible but unhelpful step. Misleading prefixes make this worse, and the per-token supervision signal can collapse because locally plausible continuations may reinforce the wrong branch (Fu et al., 2026; Zhu et al., 2026).

Diagram showing a local target mismatch where teacher feedback mixes recovery actions with continuations of a misleading student prefix — The teacher’s local distribution may mix recovery actions with continuations of the misleading path. The dense signal is still local, but it is no longer a clean recovery target.

A number of recent works, directly or indirectly, identify this local noisy teacher supervision and propose ways to recover more reliable guidance. Their fixes often revolve around two kinds of interventions, sometimes separately and sometimes together: modifying the KL-style objective, and reweighting or selecting token-level signals. Fu et al. use teacher top-K local support matching to restrict the KL comparison to a more reliable local support (Fu et al., 2026). TIP reweights tokens using student entropy and teacher-student divergence, focusing the update on uncertain positions and overconfident mistakes (Xu et al., 2026). SCOPE routes rollouts by correctness, using teacher-perplexity-weighted KL for failed trajectories and student-perplexity-weighted MLE for successful ones (Zheng et al., 2026). Entropy-Aware OPD changes the divergence by teacher entropy, combining reverse KL in low-entropy regions with forward KL in high-entropy regions (Jin et al., 2026). OPSD introduces per-token clipped KL to stabilize training (Zhao et al., 2026), while DASD makes the direction of supervision entropy-adaptive to preserve exploration (Zhang et al., 2026). The common lesson is that dense supervision is not automatically reliable supervision: OPD needs some way to decide which local signals to trust, how strongly to trust them, and in which direction to learn.

Horizon-induced Teacher Coverage Decay

The second pitfall is the horizon-level version of the local problem above. By training on rollouts generated by the student itself, OPD addresses the student-side compounding error that appears in off-policy distillation methods such as SFT and is also familiar from the imitation-learning literature. However, OPD does not eliminate the distribution shift; it relocates the burden from the student to the teacher. The student is on-policy, but the teacher is asked to supervise prefixes sampled from the student rather than prefixes the teacher would naturally produce. Early in a rollout, this distinction may be small. After several student decisions, however, the prefix can contain a different decomposition, a different intermediate value, or a different plan from the one the teacher would have followed. Even if each deviation is minor, the deviations accumulate in the prefix. As the horizon grows, the teacher is increasingly queried on states outside its usual prefix distribution. Although teacher supervision remains dense, it can become less reliable as the local noise accumulates over longer horizons, especially in tasks that require long responses such as MATH-style reasoning tasks.

Diagram showing teacher coverage decay over long OPD rollouts — Early student prefixes can still lie inside the teacher’s usual prefix distribution. Over a long rollout, the student trajectory may gradually drift out of that coverage region, so the feedback remains dense but becomes less reliable.

Recent fixes mostly try to prevent long rollouts from turning dense supervision into dense but unreliable supervision. StableOPD stabilizes training with a reference-based divergence constraint and rollout mixture distillation (Luo et al., 2026). Prune-OPD makes the reliability test more explicit by monitoring student-teacher compatibility, reducing the influence of unreliable rollout suffixes, and adapting the rollout budget according to the effective length of reliable tokens in recent samples (Yang et al., 2026). Liu et al., 2026 similarly detect when later trajectory segments become hard to teach locally and stop applying dense supervision to those unreliable suffixes. Pushing this logic to its most direct form, Zhou et al., 2026 and Zhang et al., 2026 naively limit the rollout budget itself at extreme (e.g., response token = 100), yet still obtain comparable performance.

Can a Perfect Teacher Recover the Correction Path?

So far, the main OPD failures we have discussed come from the reliability of teacher supervision. These concerns naturally lead to a sharper question: if we remove the reliability problem and give OPD a perfect teacher, can it guide the student the correction path with a misleading prefix? Sadly, the answer is still probable no under the current OPD paradigm. Current per-token OPD is structurally limited because it is a post-hoc per-token objective evaluated along the student’s rollout. A useful way to explain why is to examine the gradient that practical OPD actually uses. If we start from a sequence-level reverse-KL objective, the policy-gradient form gives each sampled token a return that includes both its own teacher-student log-ratio and the future log-ratios along the same student rollout:

$$ \nabla_\theta J(\theta) = \mathbb{E}_{x\sim \mathcal{D},\, y\sim \pi_\theta(\cdot \mid x)} \left[ \sum_{t=1}^{|y|} \left( \delta_t + \sum_{t'=t+1}^{|y|} \delta_{t'} \right) \nabla_\theta \log \pi_\theta(y_t \mid x, y_{\lt t}) \right], \tag{2} $$

where $\delta_t = \log \pi_\theta(y_t \mid x, y_{\lt t}) - \log \pi_T(y_t \mid x, y_{\lt t})$; see Yang et al., 2026 for the full derivation. In practice, however, OPD is usually implemented as a per-token surrogate:

$$ \nabla_\theta J(\theta) = \mathbb{E}_{x\sim \mathcal{D},\, y\sim \pi_\theta(\cdot \mid x)} \left[ \sum_{t=1}^{|y|} \delta_t \nabla_\theta \log \pi_\theta(y_t \mid x, y_{\lt t}) \right]. \tag{3} $$

This is a biased gradient relative to the sequence-level objective. Under bounded reward and score-function assumptions, the sequence-level estimator has a worst-case variance that scales as $O(T^4)$, while the per-token surrogate scales as $O(T^2)$ (Fu et al., 2026). This stabilizes long-response training, but it also introduces myopic teacher supervision, where each position’s local teacher mismatch is treated as the whole learning signal rather part of a trajectory-level return.

This myopic gradient becomes limiting when the student reaches a misleading prefix and recovery requires a multi-step correction path. When the teacher sees such a misleading prefix and tries to recover, it may correctly assign high probability to a correction-onset token such as “Wait” or “Actually”. Such methods can make the first correction token more likely, but standard OPD still conditions the following loss on the student’s original failed continuation, not on the prefix that would have followed from that correction token. Even if the teacher knows how the full recovery should unfold, the gradient can keep pointing to correction-onset tokens from several misleading contexts without supervising the multi-step correction path itself. This concern was also raised informally by Prof. Omar Khattab on X. The myopic signal is not useless, however; biasing the student toward a correction-onset token can still trigger self-reflection for future rollouts, although unfolding the full corrected path remains bounded by the student’s own capacity. A more detailed analysis of this myopic-supervision mechanism appears in our work (Jiang et al., 2026).

Diagram showing that myopic per-token OPD repeats local correction hints while later supervision remains on the misleading student rollout — Per-token OPD can strengthen local correction hints, but those hints remain attached to misleading prefixes. The corrected continuation needed for multi-step recovery is not generated.

A natural response is possible to move back from the token-level surrogate to the sequence-level gradient, but this does not resolve the underlying issue as the vanilla OPD is post-hoc supervision. The student first generates the entire rollout, and teacher supervision is then computed along that fixed trajectory. A sequence-level loss may provide a stronger correction signal for earlier decisions, but it still does not expose the student to the recovery path; the later supervision remains conditioned on the original failed continuation. Meanwhile, as teacher is not perfect in practice, noise in the local supervision signals for later tokens can also propagate back to earlier tokens and accumulate there, thereby further degrading the learning signals for earlier tokens.

For myopic supervision, prior fixes are ineffective because they do not intervene on the frozen failed student rollout itself. One route is interleaved teacher-student sampling. SKD lets the student propose tokens and asks the teacher to reject or replace low-quality proposals from students (Xu et al., 2025), Although SKD is not motivated by this, it can be interpreted as an intervention to the frozen student rollout to recover the correction path. However, SKD introduces extra generation-time cost to implement its interleaved sampling strategy. More importantly, this intervention can still fail because the teacher may be queried on the same misleading prefixes that gave rise to the mixture-distribution issue in the first place. In the extreme case, the teacher either provides little useful recovery information or effectively dominates the rollout, turning the resulting trajectory into an off-policy sample towards students.

Another route is to generate a refined trajectory after the raw full rollout. In Trajectory-Refined Distillation (TRD) (Jiang et al., 2026), the student first samples a raw on-policy rollout ($y_o$), which may contain a misleading prefix or a failed reasoning path. The teacher then generates a refined rollout ($y_r$) that unfolds the correction path while remaining close to the student’s distribution support by preserving useful parts of ($y_o$), including its style when appropriate. Distillation is then applied along this refined trajectory. Beyond improving single-attempt performance, TRD can also broaden reasoning-path coverage by exposing the student to alternative valid derivations. Similar to SKD, TRD also introduce an additional sampling cost for the refined trajectory. TRD may still fail when the required recovery exceeds the teacher’s capability, especially for smaller teacher models.

Diagram showing Trajectory-Refined Distillation refining a raw on-policy rollout into a corrected trajectory — Trajectory-Refined Distillation refines a raw on-policy rollout into a corrected trajectory before applying distillation.

Prospects

Looking ahead, three possible directions seem especially promising for future OPD research.

Mitigation Myopic Supervision. Naive token reweighting and length truncation leave the myopic supervision problem purely intact. The teacher should not merely provide post-hoc supervision after student rollouts; instead, it should help shape the final trajectory used for OPD training to mitigate the failure of trajectory. Focusing on addresing this problem can fundamentally alleviate the local noisy and length inflation issues mentioned above.

Integration of RLVR and OPD. A growing body of recent work explores combining RLVR with OPD to obtain stronger learning signals, especially for verification-failed trajectories (Li. et al., 2026). Nevertheless, naively routing failed trajectories to OPD can still inherit OPD’s own failure modes and may fail to match the performance of vanilla RLVR algorithms. A deeper understanding of how OPD and RLVR signals should interact provides insights for building more reliable OPD algorithms.

Learning Beyond the Teacher. Recent work suggests that, under appropriate objective design, the student may surpass the teacher instead of being bounded by teacher performance (Yang et al., 2026). This raises a broader question: when should OPD rely on the teacher, and when should it downweight or even discard teacher signals? OPSD offers one possible route by removing the explicit teacher policy and using additional information, such as the ground-truth answer, to guide self-distillation. This view is also reminiscent of goal-conditioned RL, where the learning signal is tied less to imitating a fixed teacher and more to reaching a desired outcome.

Overall, I see OPD less as a finished recipe than as a useful interface that will likely become deeply coupled with the broader post-training stack. Future progress may therefore depend not only on improving OPD itself, but also on understanding how it should interact with RLVR and other post-training techniques. Comments, corrections, and pointers to related work are very welcome!

Video introduction of On-policy OPD from Dwarkesh Patel.
On SFT, RL, and on-policy distillation from Will Brown.
SFT, RL, and On-Policy Distillation Through a Distributional Lens from wh..
The Imitation Game: State of Policy Distillation in Language Model training from Chinmay Karkar.
Github AwesomeOPD from Wei Liu.

Citation

If you find this post or our paper useful, please cite:

@misc{jiang2026opdreflection,
  author = {Jiang, Li},
  title = {On-Policy Distillation: Promise, Pitfalls, and Prospects},
  year = {2026},
  month = {June},
  url = {https://louieworth.github.io/blog/opd_reflection/},
  note = {Blog post}
}

@misc{jiang2026trajectoryrefineddistillation,
  author = {Jiang, Li and Xu, Haoran and Ding, Yichuan and Zhang, Amy},
  title = {Trajectory-Refined Distillation},
  year = {2026},
  eprint = {2606.08432},
  archivePrefix = {arXiv},
  primaryClass = {cs.AI},
  url = {https://arxiv.org/abs/2606.08432}
}