Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR

Hugging Face Daily Papers 06/27/26, 12:00 AM Papers

Summary

This paper proposes Transfer-Aware Curriculum (TAC), a bandit-style online curriculum for multi-domain RLVR that prioritizes domains whose updates benefit other domains using gradient-geometry alignment. TAC improves macro-averaged accuracy on Qwen3-1.7B and Llama3.2-3B over fixed and learnability-only curricula.

Reinforcement learning with verifiable rewards (RLVR) has been extended from single-domain training to multi-domain reasoning suites spanning mathematics, programming, and science. However, the training curriculum (how often each domain is sampled) is typically fixed or hand-tuned, even though reasoning skills transfer unevenly across domains. Existing learnability-based curricula adapt to where the policy is currently improving, but are blind to whether a gradient step on the selected domain benefits the remaining domains. In this paper, we propose Transfer-Aware Curriculum (TAC), a bandit-style online curriculum that prioritizes domains whose updates broadly benefit the rest of the training suite. TAC repurposes signals already produced by RL training: per-domain advantages capture local learnability, and projected gradients, taken from the GRPO step being computed, estimate cross-domain transferability via gradient-geometry alignment, at negligible cost (<1% wall-clock overhead). Across a six-domain reasoning suite, TAC achieves the best macro-averaged accuracy on both Qwen3-1.7B and Llama3.2-3B, outperforming proportional random sampling, a hand-designed schedule, and a learnability-only bandit, and improving over the last of these by up to 2.8 points (10% relative). Ablations show performance degrades sharply when the transferability term is removed, and TAC remains robust on imbalanced training mixtures where learnability-only curricula over-commit to dominant domains. Our findings establish cross-domain transferability as a key signal for curriculum design in multi-domain RLVR.

Original Article

View Cached Full Text

Cached at: 07/03/26, 07:53 AM

Paper page - Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR

Source: https://huggingface.co/papers/2606.25178

Abstract

Transfer-Aware Curriculum (TAC) improves multi-domain reinforcement learning by prioritizing domains that provide broad benefits to other domains, using gradient-geometry alignment to estimate cross-domain transferability.

Reinforcement learningwithverifiable rewards(RLVR) has been extended from single-domain training tomulti-domain reasoningsuites spanning mathematics, programming, and science. However, the training curriculum (how often each domain is sampled) is typically fixed or hand-tuned, even though reasoning skills transfer unevenly across domains. Existing learnability-based curricula adapt to where the policy is currently improving, but are blind to whether a gradient step on the selected domain benefits the remaining domains. In this paper, we propose Transfer-Aware Curriculum (TAC), abandit-style online curriculumthat prioritizes domains whose updates broadly benefit the rest of the training suite. TAC repurposes signals already produced by RL training: per-domain advantages capture local learnability, and projected gradients, taken from theGRPOstep being computed, estimate cross-domaintransferabilityviagradient-geometry alignment, at negligible cost (<1% wall-clock overhead). Across a six-domain reasoning suite, TAC achieves the bestmacro-averaged accuracyon both Qwen3-1.7B and Llama3.2-3B, outperforming proportional random sampling, a hand-designed schedule, and a learnability-only bandit, and improving over the last of these by up to 2.8 points (10% relative). Ablations show performance degrades sharply when thetransferabilityterm is removed, and TAC remains robust on imbalanced training mixtures where learnability-only curricula over-commit to dominant domains. Our findings establish cross-domaintransferabilityas a key signal for curriculum design in multi-domain RLVR.

View arXiv page View PDF GitHub2 Add to collection

Get this paper in your agent:

hf papers read 2606\.25178

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.25178 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.25178 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.25178 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR

Paper page - Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA

Multi-Turn Reasoning When Context Arrives in Pieces: Scalable Sharding and Memory-Augmented RL

Stratagem: Learning Transferable Reasoning via Trajectory-Modulated Game Self-Play

Tandem Reinforcement Learning with Verifiable Rewards

GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero

Submit Feedback

Similar Articles

What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA

Multi-Turn Reasoning When Context Arrives in Pieces: Scalable Sharding and Memory-Augmented RL

Stratagem: Learning Transferable Reasoning via Trajectory-Modulated Game Self-Play

Tandem Reinforcement Learning with Verifiable Rewards

GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero