The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs
Summary
This paper identifies a safety threshold in on-policy distillation with reward extrapolation, beyond which structured output tasks lose format preservation. Empirical validation shows that operating below this threshold allows a 1.7B student model to match an 8B SFT baseline on Amazon Fashion tasks with one-fifth the parameters.
View Cached Full Text
Cached at: 05/14/26, 04:17 AM
Paper page - The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs
Source: https://huggingface.co/papers/2605.08737 Published on May 9
·
Submitted byhttps://huggingface.co/XINLI1997
XinLion May 14
Abstract
On-policy distillation with reward extrapolation exhibits a safety threshold beyond which structured output tasks lose format preservation, with empirical validation showing performance parity at reduced parameter count when operating below this threshold.
On-policy distillation(OPD) is widely used for LLM post-training. When pushed with areward-extrapolation coefficientlambda > 1, the student can lift past the teacher in domain, but past a threshold lambda* the same step violates the output contract onstructured-output tasks. In a single-positionBernoulli reduction, we derive a closed-form base-relativeclip-safety thresholdlambda*(p,b,c) determined by three measurable quantities: the teacher modal probability, the warm-start mass, and the importance-sampling clip strength. Above lambda*, the extrapolatedfixed pointexits the clip-safe region, changing training fromformat-preservingtoformat-collapsing. We extend the rule to calibratedK-ary listwise JSONtasks where a single binding equivalence class dominates the output contract andSFTretains parse headroom. On Amazon Fashion, three pre-registered tests--a fine-grid cliff interval, a budget-extension test, and a small-clip cross-prediction--fall within their locked prediction windows, with the small-clip value matching the closed-form prediction below grid resolution. Operating just below lambda*, ListOPD brings a 1.7B Qwen3 student to in-domain parity with an 8B-SFTbaseline at one-fifth the parameters. The gain is driven primarily by format adherence:NDCG@1on parsed outputs remains flat across lambda, whileparse validitysharply changes at the predicted boundary. The cliff diagnostic is rubric-independent, whereas the parity claim uses a Gemini-graded rubric and inherits that evaluator’s exposure.
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2605\.08737
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.08737 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.08737 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.08737 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes
This paper presents a comprehensive empirical study on on-policy distillation for large language models, identifying failure mechanisms like distribution mismatch and optimization instability, and proposing fixes such as stop-gradient objectives and RLVR-adapted teachers.
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
This paper investigates the parameter-level mechanisms behind the efficiency of On-Policy Distillation (OPD) for large language models, attributing it to early 'foresight' in module allocation and update direction. It proposes EffOPD, a plug-and-play method that accelerates OPD training by 3x without compromising final performance.
Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why
This paper introduces a training-free diagnostic framework to analyze per-token distillation signals for reasoning models, revealing that guidance is more beneficial on incorrect rollouts and depends on student capacity and task context.
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
This paper proposes an empirical 'sparse-to-dense' reward principle for language model post-training, arguing that scarce labeled data should be used with sparse rewards for teacher model discovery and dense rewards for student compression via distillation. The authors demonstrate that this staged approach, bridging sparse RL and on-policy distillation, outperforms direct GRPO on deployment-sized models in math benchmarks.
The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation
This paper identifies that on-policy distillation (OPD) in language models leads to severe overconfidence due to information mismatch between training and deployment, and proposes CaOPD, a calibration-aware framework that improves both performance and confidence reliability.