Crosslingual On-Policy Self-Distillation for Multilingual Reasoning

Hugging Face Daily Papers Papers

Summary

The paper proposes Crosslingual On-Policy Self-Distillation (COPSD), a method to transfer high-resource language reasoning capabilities to low-resource languages using a shared student-teacher architecture. Experiments across 17 African languages show significant improvements in mathematical reasoning and answer-format adherence, outperforming Group Relative Policy Optimization (GRPO).

Large language models (LLMs) have achieved remarkable progress in mathematical reasoning, but this ability is not equally accessible across languages. Especially low-resource languages exhibit much lower reasoning performance. To address this, we propose Crosslingual On-Policy Self-Distillation (COPSD), which transfers a model's own high-resource reasoning behavior to low-resource languages. COPSD uses the same model as student and teacher: the student sees only the low-resource problem, while the teacher receives privileged crosslingual context, including the problem translation and reference solution in English. Training minimizes full-distribution token-level divergence on the student's own rollouts, providing dense supervision while avoiding the sparsity and instability of outcome-only reinforcement learning (RL). Experiments on 17 low-resource African languages show that COPSD consistently improves low-resource mathematical reasoning across model sizes and substantially outperforms Group Relative Policy Optimization (GRPO). Further analyses show that COPSD improves answer-format adherence, strengthens test-time scaling, and generalizes to harder multilingual reasoning benchmarks, with especially large gains for lower-resource languages. We make our code and data available at: https://github.com/cisnlp/COPSD.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/12/26, 10:51 AM

Paper page - Crosslingual On-Policy Self-Distillation for Multilingual Reasoning

Source: https://huggingface.co/papers/2605.09548

Abstract

COPSD transfers high-resource language model reasoning behavior to low-resource languages using self-distillation with crosslingual context, improving mathematical reasoning performance.

Large language models(LLMs) have achieved remarkable progress inmathematical reasoning, but this ability is not equally accessible across languages. Especiallylow-resource languagesexhibit much lower reasoning performance. To address this, we proposeCrosslingual On-Policy Self-Distillation(COPSD), which transfers a model’s own high-resource reasoning behavior tolow-resource languages. COPSD uses the same model as student and teacher: the student sees only the low-resource problem, while the teacher receives privileged crosslingual context, including the problem translation and reference solution in English. Training minimizes full-distributiontoken-level divergenceon the student’s own rollouts, providing dense supervision while avoiding the sparsity and instability of outcome-onlyreinforcement learning(RL). Experiments on 17 low-resource African languages show that COPSD consistently improves low-resourcemathematical reasoningacross model sizes and substantially outperforms Group RelativePolicy Optimization(GRPO). Further analyses show that COPSD improves answer-format adherence, strengthenstest-time scaling, and generalizes to hardermultilingual reasoning benchmarks, with especially large gains for lower-resource languages. We make our code and data available at: https://github.com/cisnlp/COPSD.

View arXiv pageView PDFGitHubAdd to collection

Get this paper in your agent:

hf papers read 2605\.09548

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.09548 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.09548 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.09548 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Reasoning Compression with Mixed-Policy Distillation

arXiv cs.AI

This paper proposes Mixed-Policy Distillation (MPD), a framework that transfers concise reasoning behaviors from large teacher models to smaller student models, reducing token usage by up to 27.1% while improving performance.

Think Multilingual, Not Harder: A Data-Efficient Framework for Teaching Reasoning Models to Code-Switch

arXiv cs.CL

This paper introduces a data-efficient fine-tuning framework for teaching reasoning models to code-switch (mix languages) effectively, demonstrating that strategic code-switching can improve reasoning capabilities for lower-resource languages. The work analyzes code-switching behaviors in large language models across diverse languages, tasks, and domains, then develops interventions to promote beneficial code-switching patterns.

Rubric-based On-policy Distillation

Hugging Face Daily Papers

This paper introduces ROPD, a rubric-based on-policy distillation framework that achieves superior sample efficiency compared to traditional logit-based methods. It enables model alignment in black-box scenarios by using structured semantic rubrics instead of teacher logits.

Large Language Models for Math Education in Low-Resource Languages: A Study in Sinhala and Tamil

arXiv cs.CL

This paper evaluates the mathematical reasoning capabilities of large language models in Sinhala and Tamil, two low-resource South Asian languages, using a parallel dataset of independently authored problems. The study demonstrates that while basic arithmetic transfers well across languages, complex reasoning tasks show significant performance degradation in non-English languages, with implications for deploying AI tutoring tools in multilingual educational contexts.