reward-extrapolation

Tag

Cards List
#reward-extrapolation

The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs

Hugging Face Daily Papers · 5d ago Cached

This paper identifies a safety threshold in on-policy distillation with reward extrapolation, beyond which structured output tasks lose format preservation. Empirical validation shows that operating below this threshold allows a 1.7B student model to match an 8B SFT baseline on Amazon Fashion tasks with one-fifth the parameters.

0 favorites 0 likes
← Back to home

Submit Feedback