Tag
This paper identifies a safety threshold in on-policy distillation with reward extrapolation, beyond which structured output tasks lose format preservation. Empirical validation shows that operating below this threshold allows a 1.7B student model to match an 8B SFT baseline on Amazon Fashion tasks with one-fifth the parameters.