reward-extrapolation

#reward-extrapolation

The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs

Hugging Face Daily Papers ↗ · 5d ago Cached

This paper identifies a safety threshold in on-policy distillation with reward extrapolation, beyond which structured output tasks lose format preservation. Empirical validation shows that operating below this threshold allows a 1.7B student model to match an 8B SFT baseline on Amazon Fashion tasks with one-fifth the parameters.

0 favorites 0 likes

reward-extrapolation

The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs

Submit Feedback