Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding
Summary
CoRD is a collaborative multi-teacher decoding framework that synthesizes reasoning trajectories through predictive perplexity scoring and beam search, enabling efficient distillation of large reasoning models with high-quality outputs and generalized performance.
View Cached Full Text
Cached at: 05/18/26, 10:25 AM
Paper page - Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding
Source: https://huggingface.co/papers/2605.02290
Abstract
CoRD is a collaborative multi-teacher decoding framework that synthesizes reasoning trajectories through predictive perplexity scoring and beam search, enabling efficient distillation of large reasoning models with high-quality outputs and generalized performance.
Distilling large reasoning modelsis essential for makingLong-CoT reasoningpractical, as full-scale inference remains computationally prohibitive. Existing curation-based approaches select complete reasoning traces post-hoc, overlooking collaboration amongheterogeneous teachersand lacking dynamic exploration, which leads to redundant sampling and missed complementary reasoning. We introduce CoRD, acollaborative multi-teacher decodingframework that performs step-wise reasoning synthesis guided bypredictive perplexity-based scoringandbeam search. This enables heterogeneous LRMs to jointly construct coherentreasoning trajectorieswhile efficiently preserving diverse, high-potential hypotheses. Experiments show that CoRD produces higher-quality reasoning data and achieves near teacher-level student performance with fewer,structured supervision signals, without substantial efficiency overhead. CoRD further generalizes well to out-of-domain and open-ended settings. The dataset and model are available at https://github.com/DISL-Lab/CoRD{https://github.com/DISL-Lab/CoRD}.
View arXiv pageView PDFGitHub1Add to collection
Get this paper in your agent:
hf papers read 2605\.02290
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.02290 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.02290 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.02290 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning
Proposes ProxyCoT, a training framework that improves long-context reasoning in large language models by first obtaining chain-of-thought reasoning traces on short proxy contexts (via reinforcement learning or distillation) and then grounding them in full long contexts through supervised fine-tuning. Experiments show consistent improvements over baselines with reduced computational cost.
Improving Reasoning Capabilities in Small Models through Mixture-of-Layers Distillation with Stepwise Attention on Key Information
This paper proposes a novel Chain-of-Thought distillation framework that transfers teacher models' stepwise attention on key information to student models through a Mixture-of-Layers module for dynamic layer alignment. The method achieves consistent performance improvements on mathematical and commonsense reasoning benchmarks by explicitly guiding student models to progressively focus on critical information during reasoning.
LoRi: Low-Rank Distillation for Implicit Reasoning
LoRi proposes a low-rank distillation framework for implicit chain-of-thought reasoning that aligns teacher and student trajectories in a shared low-rank subspace, improving performance on mathematical reasoning benchmarks.
COFT: Counterfactual-Conformal Decoding for Fair Chain-of-Thought Reasoning in Large Language Models
COFT is a training-free decoding method that applies token-level fairness control and conformal calibration to reduce bias in chain-of-thought reasoning of large language models, achieving 30-55% bias reduction with minimal computational overhead.
CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning
Contrastive Reflection (CORE) is a non-parametric algorithm that generates concise, interpretable insights from comparing successful and unsuccessful reasoning traces, enabling faster and more efficient self-improvement for language models with fewer samples and rollouts than existing methods.