Trust-Region Behavior Blending for On-Policy Distillation
Summary
Trust-Region behavior Blending (TRB) improves on-policy distillation by replacing poor early student rollouts with teacher-like behavior within a KL trust region during warmup, achieving stronger results on math-reasoning tasks.
View Cached Full Text
Cached at: 06/01/26, 11:20 AM
Paper page - Trust-Region Behavior Blending for On-Policy Distillation
Source: https://huggingface.co/papers/2605.31159
Abstract
Trust-Region behavior Blending improves on-policy distillation by replacing early poor-quality student rollouts with teacher-like behavior within a KL trust region during warmup.
On-policy distillation(OPD) trains a student on prefixes sampled from its own policy while matching a stronger teacher. This addresses theprefix mismatchofoffline distillation, but early student rollouts can still be poor, placing teacher supervision on weak or low-quality prefixes. We propose Trust-Regionbehavior Blending(TRB), a warmup method that replaces the early rollout policy with the closest-to-teacher behavior policy inside a student-centeredKL trust region, while keeping the per-prefix reverse-KL OPD loss unchanged. The KL budget is annealed to zero, so training returns to pure student rollouts after warmup. Across two math-reasoning distillation settings, TRB attains the strongest average among the compared methods.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.31159
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.31159 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.31159 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.31159 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Trust Region On-Policy Distillation
The paper proposes Trust Region On-Policy Distillation (TrOPD) to stabilize on-policy distillation of large language models by using trust regions, outlier estimation, and off-policy guidance, outperforming existing methods on reasoning and code generation benchmarks.
Trust-Region Diffusion Policies for Massively Parallel On-Policy RL
Introduces TruDi, a method that enables training diffusion policies in massively parallel on-policy reinforcement learning by using a trust-region optimization rule to enforce KL constraints, achieving strong performance across 73 tasks.
OPRD: On-Policy Representation Distillation
OPRD proposes a new knowledge distillation method that aligns student and teacher hidden states across layers during on-policy rollouts, eliminating sampling variance from token-space KL estimation. Empirically, OPRD outperforms output-space baselines on math reasoning benchmarks (AIME 2024/2025, AIMO) while being 1.44x faster and using 54% less memory.
Trust Region Q Adjoint Matching
Trust Region Q-Adjoint Matching (TRQAM) addresses instability in off-policy reinforcement learning by adaptively controlling path-space KL divergence through projected dual descent, enabling stable fine-tuning of pretrained flow policies. The method consistently outperforms prior arts on 50 OGBench tasks, achieving a 68% success rate in offline RL compared to the strongest baseline's 46%.
Trust Region Inverse Reinforcement Learning: Explicit Dual Ascent using Local Policy Updates
This paper introduces Trust Region Inverse Reinforcement Learning (TRIRL), a method that combines monotonic dual improvement with efficient local policy updates to outperform state-of-the-art imitation learning methods. It addresses the trade-off between stability and computational cost in IRL by using trust-region constraints.