Crosslingual On-Policy Self-Distillation for Multilingual Reasoning
Summary
The paper proposes Crosslingual On-Policy Self-Distillation (COPSD), a method to transfer high-resource language reasoning capabilities to low-resource languages using a shared student-teacher architecture. Experiments across 17 African languages show significant improvements in mathematical reasoning and answer-format adherence, outperforming Group Relative Policy Optimization (GRPO).
View Cached Full Text
Cached at: 05/12/26, 10:51 AM
Paper page - Crosslingual On-Policy Self-Distillation for Multilingual Reasoning
Source: https://huggingface.co/papers/2605.09548
Abstract
COPSD transfers high-resource language model reasoning behavior to low-resource languages using self-distillation with crosslingual context, improving mathematical reasoning performance.
Large language models(LLMs) have achieved remarkable progress inmathematical reasoning, but this ability is not equally accessible across languages. Especiallylow-resource languagesexhibit much lower reasoning performance. To address this, we proposeCrosslingual On-Policy Self-Distillation(COPSD), which transfers a model’s own high-resource reasoning behavior tolow-resource languages. COPSD uses the same model as student and teacher: the student sees only the low-resource problem, while the teacher receives privileged crosslingual context, including the problem translation and reference solution in English. Training minimizes full-distributiontoken-level divergenceon the student’s own rollouts, providing dense supervision while avoiding the sparsity and instability of outcome-onlyreinforcement learning(RL). Experiments on 17 low-resource African languages show that COPSD consistently improves low-resourcemathematical reasoningacross model sizes and substantially outperforms Group RelativePolicy Optimization(GRPO). Further analyses show that COPSD improves answer-format adherence, strengthenstest-time scaling, and generalizes to hardermultilingual reasoning benchmarks, with especially large gains for lower-resource languages. We make our code and data available at: https://github.com/cisnlp/COPSD.
View arXiv pageView PDFGitHubAdd to collection
Get this paper in your agent:
hf papers read 2605\.09548
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.09548 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.09548 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.09548 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Reasoning Compression with Mixed-Policy Distillation
This paper proposes Mixed-Policy Distillation (MPD), a framework that transfers concise reasoning behaviors from large teacher models to smaller student models, reducing token usage by up to 27.1% while improving performance.
Less Languages, Less Tokens: An Efficient Unified Logic Cross-lingual Chain-of-Thought Reasoning Framework
UL-XCoT introduces a unified logic space to prune low-quality multilingual reasoning paths, cutting >50% token cost while improving accuracy and robustness on low-resource languages.
Think Multilingual, Not Harder: A Data-Efficient Framework for Teaching Reasoning Models to Code-Switch
This paper introduces a data-efficient fine-tuning framework for teaching reasoning models to code-switch (mix languages) effectively, demonstrating that strategic code-switching can improve reasoning capabilities for lower-resource languages. The work analyzes code-switching behaviors in large language models across diverse languages, tasks, and domains, then develops interventions to promote beneficial code-switching patterns.
Rubric-based On-policy Distillation
This paper introduces ROPD, a rubric-based on-policy distillation framework that achieves superior sample efficiency compared to traditional logit-based methods. It enables model alignment in black-box scenarios by using structured semantic rubrics instead of teacher logits.
Large Language Models for Math Education in Low-Resource Languages: A Study in Sinhala and Tamil
This paper evaluates the mathematical reasoning capabilities of large language models in Sinhala and Tamil, two low-resource South Asian languages, using a parallel dataset of independently authored problems. The study demonstrates that while basic arithmetic transfers well across languages, complex reasoning tasks show significant performance degradation in non-English languages, with implications for deploying AI tutoring tools in multilingual educational contexts.