Crosslingual On-Policy Self-Distillation for Multilingual Reasoning

Hugging Face Daily Papers 05/10/26, 12:00 AM Papers

Summary

The paper proposes Crosslingual On-Policy Self-Distillation (COPSD), a method to transfer high-resource language reasoning capabilities to low-resource languages using a shared student-teacher architecture. Experiments across 17 African languages show significant improvements in mathematical reasoning and answer-format adherence, outperforming Group Relative Policy Optimization (GRPO).

Large language models (LLMs) have achieved remarkable progress in mathematical reasoning, but this ability is not equally accessible across languages. Especially low-resource languages exhibit much lower reasoning performance. To address this, we propose Crosslingual On-Policy Self-Distillation (COPSD), which transfers a model's own high-resource reasoning behavior to low-resource languages. COPSD uses the same model as student and teacher: the student sees only the low-resource problem, while the teacher receives privileged crosslingual context, including the problem translation and reference solution in English. Training minimizes full-distribution token-level divergence on the student's own rollouts, providing dense supervision while avoiding the sparsity and instability of outcome-only reinforcement learning (RL). Experiments on 17 low-resource African languages show that COPSD consistently improves low-resource mathematical reasoning across model sizes and substantially outperforms Group Relative Policy Optimization (GRPO). Further analyses show that COPSD improves answer-format adherence, strengthens test-time scaling, and generalizes to harder multilingual reasoning benchmarks, with especially large gains for lower-resource languages. We make our code and data available at: https://github.com/cisnlp/COPSD.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 05/12/26, 10:51 AM

Paper page - Crosslingual On-Policy Self-Distillation for Multilingual Reasoning

Source: https://huggingface.co/papers/2605.09548

Abstract

COPSD transfers high-resource language model reasoning behavior to low-resource languages using self-distillation with crosslingual context, improving mathematical reasoning performance.

Large language models(LLMs) have achieved remarkable progress inmathematical reasoning, but this ability is not equally accessible across languages. Especiallylow-resource languagesexhibit much lower reasoning performance. To address this, we proposeCrosslingual On-Policy Self-Distillation(COPSD), which transfers a model’s own high-resource reasoning behavior tolow-resource languages. COPSD uses the same model as student and teacher: the student sees only the low-resource problem, while the teacher receives privileged crosslingual context, including the problem translation and reference solution in English. Training minimizes full-distributiontoken-level divergenceon the student’s own rollouts, providing dense supervision while avoiding the sparsity and instability of outcome-onlyreinforcement learning(RL). Experiments on 17 low-resource African languages show that COPSD consistently improves low-resourcemathematical reasoningacross model sizes and substantially outperforms Group RelativePolicy Optimization(GRPO). Further analyses show that COPSD improves answer-format adherence, strengthenstest-time scaling, and generalizes to hardermultilingual reasoning benchmarks, with especially large gains for lower-resource languages. We make our code and data available at: https://github.com/cisnlp/COPSD.

View arXiv page View PDF GitHub Add to collection

Get this paper in your agent:

hf papers read 2605\.09548

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.09548 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.09548 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.09548 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Crosslingual On-Policy Self-Distillation for Multilingual Reasoning

Paper page - Crosslingual On-Policy Self-Distillation for Multilingual Reasoning

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Reasoning Compression with Mixed-Policy Distillation

Less Languages, Less Tokens: An Efficient Unified Logic Cross-lingual Chain-of-Thought Reasoning Framework

Think Multilingual, Not Harder: A Data-Efficient Framework for Teaching Reasoning Models to Code-Switch

Rubric-based On-policy Distillation

Large Language Models for Math Education in Low-Resource Languages: A Study in Sinhala and Tamil

Submit Feedback

Similar Articles

Reasoning Compression with Mixed-Policy Distillation

Less Languages, Less Tokens: An Efficient Unified Logic Cross-lingual Chain-of-Thought Reasoning Framework

Think Multilingual, Not Harder: A Data-Efficient Framework for Teaching Reasoning Models to Code-Switch

Rubric-based On-policy Distillation

Large Language Models for Math Education in Low-Resource Languages: A Study in Sinhala and Tamil