Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR
Summary
This paper proposes Transfer-Aware Curriculum (TAC), a bandit-style online curriculum for multi-domain RLVR that prioritizes domains whose updates benefit other domains using gradient-geometry alignment. TAC improves macro-averaged accuracy on Qwen3-1.7B and Llama3.2-3B over fixed and learnability-only curricula.
View Cached Full Text
Cached at: 07/03/26, 07:53 AM
Paper page - Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR
Source: https://huggingface.co/papers/2606.25178
Abstract
Transfer-Aware Curriculum (TAC) improves multi-domain reinforcement learning by prioritizing domains that provide broad benefits to other domains, using gradient-geometry alignment to estimate cross-domain transferability.
Reinforcement learningwithverifiable rewards(RLVR) has been extended from single-domain training tomulti-domain reasoningsuites spanning mathematics, programming, and science. However, the training curriculum (how often each domain is sampled) is typically fixed or hand-tuned, even though reasoning skills transfer unevenly across domains. Existing learnability-based curricula adapt to where the policy is currently improving, but are blind to whether a gradient step on the selected domain benefits the remaining domains. In this paper, we propose Transfer-Aware Curriculum (TAC), abandit-style online curriculumthat prioritizes domains whose updates broadly benefit the rest of the training suite. TAC repurposes signals already produced by RL training: per-domain advantages capture local learnability, and projected gradients, taken from theGRPOstep being computed, estimate cross-domaintransferabilityviagradient-geometry alignment, at negligible cost (<1% wall-clock overhead). Across a six-domain reasoning suite, TAC achieves the bestmacro-averaged accuracyon both Qwen3-1.7B and Llama3.2-3B, outperforming proportional random sampling, a hand-designed schedule, and a learnability-only bandit, and improving over the last of these by up to 2.8 points (10% relative). Ablations show performance degrades sharply when thetransferabilityterm is removed, and TAC remains robust on imbalanced training mixtures where learnability-only curricula over-commit to dominant domains. Our findings establish cross-domaintransferabilityas a key signal for curriculum design in multi-domain RLVR.
View arXiv pageView PDFGitHub2Add to collection
Get this paper in your agent:
hf papers read 2606\.25178
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.25178 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.25178 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.25178 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA
This paper empirically studies how the composition of training data (curriculum) affects the skills learned by RL-based memory agents in multi-session question answering. It finds that curriculum composition acts as a fine-grained lever on specialization, with mixed benchmarks yielding the best overall performance and narrow out-of-domain sets transferring targeted temporal reasoning skills.
Multi-Turn Reasoning When Context Arrives in Pieces: Scalable Sharding and Memory-Augmented RL
This paper addresses the 'Lost in Conversation' problem where LLMs struggle with information revealed across multiple turns. It proposes a scalable sharding pipeline to create multi-turn training data from single-turn QA datasets and uses reinforcement learning with verifiable rewards to train a memory-augmented policy that maintains a compact rolling memory, improving multi-turn reasoning accuracy and generalizing zero-shot to harder tasks.
Stratagem: Learning Transferable Reasoning via Trajectory-Modulated Game Self-Play
STRATAGEM is a new framework for improving reasoning transferability in language models by using game self-play with a Reasoning Transferability Coefficient and Reasoning Evolution Reward to reinforce abstract, domain-agnostic reasoning patterns over game-specific heuristics. Experiments show strong improvements on mathematical reasoning, general reasoning, and code generation benchmarks.
Tandem Reinforcement Learning with Verifiable Rewards
Proposes Tandem Reinforcement Learning (TRL), extending the tandem training paradigm to RLVR to improve reasoning compatibility and legibility for weaker models and humans, showing that TRL matches solo performance while enhancing handoff robustness and reducing distributional drift.
GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero
GRLO introduces a novel reinforcement learning post-training method that achieves strong generalization across multiple domains (math, code, etc.) from only 5K prompts and 22.7 GPU hours, significantly outperforming in-domain RLVR baselines in efficiency and data requirements.