The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs

Hugging Face Daily Papers 05/09/26, 12:00 AM Papers

Summary

This paper identifies a safety threshold in on-policy distillation with reward extrapolation, beyond which structured output tasks lose format preservation. Empirical validation shows that operating below this threshold allows a 1.7B student model to match an 8B SFT baseline on Amazon Fashion tasks with one-fifth the parameters.

On-policy distillation (OPD) is widely used for LLM post-training. When pushed with a reward-extrapolation coefficient lambda > 1, the student can lift past the teacher in domain, but past a threshold lambda* the same step violates the output contract on structured-output tasks. In a single-position Bernoulli reduction, we derive a closed-form base-relative clip-safety threshold lambda*(p,b,c) determined by three measurable quantities: the teacher modal probability, the warm-start mass, and the importance-sampling clip strength. Above lambda*, the extrapolated fixed point exits the clip-safe region, changing training from format-preserving to format-collapsing. We extend the rule to calibrated K-ary listwise JSON tasks where a single binding equivalence class dominates the output contract and SFT retains parse headroom. On Amazon Fashion, three pre-registered tests--a fine-grid cliff interval, a budget-extension test, and a small-clip cross-prediction--fall within their locked prediction windows, with the small-clip value matching the closed-form prediction below grid resolution. Operating just below lambda*, ListOPD brings a 1.7B Qwen3 student to in-domain parity with an 8B-SFT baseline at one-fifth the parameters. The gain is driven primarily by format adherence: NDCG@1 on parsed outputs remains flat across lambda, while parse validity sharply changes at the predicted boundary. The cliff diagnostic is rubric-independent, whereas the parity claim uses a Gemini-graded rubric and inherits that evaluator's exposure.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 05/14/26, 04:17 AM

Paper page - The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs

Source: https://huggingface.co/papers/2605.08737 Published on May 9

Submitted byhttps://huggingface.co/XINLI1997

XinLion May 14

Abstract

On-policy distillation with reward extrapolation exhibits a safety threshold beyond which structured output tasks lose format preservation, with empirical validation showing performance parity at reduced parameter count when operating below this threshold.

On-policy distillation(OPD) is widely used for LLM post-training. When pushed with areward-extrapolation coefficientlambda > 1, the student can lift past the teacher in domain, but past a threshold lambda* the same step violates the output contract onstructured-output tasks. In a single-positionBernoulli reduction, we derive a closed-form base-relativeclip-safety thresholdlambda*(p,b,c) determined by three measurable quantities: the teacher modal probability, the warm-start mass, and the importance-sampling clip strength. Above lambda*, the extrapolatedfixed pointexits the clip-safe region, changing training fromformat-preservingtoformat-collapsing. We extend the rule to calibratedK-ary listwise JSONtasks where a single binding equivalence class dominates the output contract andSFTretains parse headroom. On Amazon Fashion, three pre-registered tests--a fine-grid cliff interval, a budget-extension test, and a small-clip cross-prediction--fall within their locked prediction windows, with the small-clip value matching the closed-form prediction below grid resolution. Operating just below lambda*, ListOPD brings a 1.7B Qwen3 student to in-domain parity with an 8B-SFTbaseline at one-fifth the parameters. The gain is driven primarily by format adherence:NDCG@1on parsed outputs remains flat across lambda, whileparse validitysharply changes at the predicted boundary. The cliff diagnostic is rubric-independent, whereas the parity claim uses a Gemini-graded rubric and inherits that evaluator’s exposure.

View arXiv page View PDF Project page Add to collection

Get this paper in your agent:

hf papers read 2605\.08737

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.08737 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.08737 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.08737 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs

Paper page - The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why

Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation

Submit Feedback

Similar Articles

The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why

Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation