Trust-Region Behavior Blending for On-Policy Distillation

Hugging Face Daily Papers 05/29/26, 12:00 AM Papers

Summary

Trust-Region behavior Blending (TRB) improves on-policy distillation by replacing poor early student rollouts with teacher-like behavior within a KL trust region during warmup, achieving stronger results on math-reasoning tasks.

On-policy distillation (OPD) trains a student on prefixes sampled from its own policy while matching a stronger teacher. This addresses the prefix mismatch of offline distillation, but early student rollouts can still be poor, placing teacher supervision on weak or low-quality prefixes. We propose Trust-Region behavior Blending (TRB), a warmup method that replaces the early rollout policy with the closest-to-teacher behavior policy inside a student-centered KL trust region, while keeping the per-prefix reverse-KL OPD loss unchanged. The KL budget is annealed to zero, so training returns to pure student rollouts after warmup. Across two math-reasoning distillation settings, TRB attains the strongest average among the compared methods.

Original Article

View Cached Full Text

Cached at: 06/01/26, 11:20 AM

Paper page - Trust-Region Behavior Blending for On-Policy Distillation

Source: https://huggingface.co/papers/2605.31159

Abstract

Trust-Region behavior Blending improves on-policy distillation by replacing early poor-quality student rollouts with teacher-like behavior within a KL trust region during warmup.

On-policy distillation(OPD) trains a student on prefixes sampled from its own policy while matching a stronger teacher. This addresses theprefix mismatchofoffline distillation, but early student rollouts can still be poor, placing teacher supervision on weak or low-quality prefixes. We propose Trust-Regionbehavior Blending(TRB), a warmup method that replaces the early rollout policy with the closest-to-teacher behavior policy inside a student-centeredKL trust region, while keeping the per-prefix reverse-KL OPD loss unchanged. The KL budget is annealed to zero, so training returns to pure student rollouts after warmup. Across two math-reasoning distillation settings, TRB attains the strongest average among the compared methods.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2605\.31159

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.31159 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.31159 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.31159 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Trust-Region Behavior Blending for On-Policy Distillation

Paper page - Trust-Region Behavior Blending for On-Policy Distillation

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Trust Region On-Policy Distillation

OPRD: On-Policy Representation Distillation

Trust Region Q Adjoint Matching

Trust Region Inverse Reinforcement Learning: Explicit Dual Ascent using Local Policy Updates

On-Policy Distillation (5 minute read)

Submit Feedback

Similar Articles

Trust Region On-Policy Distillation

OPRD: On-Policy Representation Distillation

Trust Region Q Adjoint Matching

Trust Region Inverse Reinforcement Learning: Explicit Dual Ascent using Local Policy Updates

On-Policy Distillation (5 minute read)