Learning to Reason with Curriculum II: Compositional Generalization

arXiv cs.LG Papers

Summary

This paper theoretically analyzes how curriculum learning, by decomposing complex problems into simpler sub-problems and composing solutions, can dramatically reduce the sample complexity of learning to simulate sequential computations (semiautomata) compared to direct methods, achieving subpolynomial supervision requirements in supervised fine-tuning and exponentially weaker coverage conditions in reinforcement learning with verifiable rewards.

arXiv:2606.27721v1 Announce Type: new Abstract: Compositional generalization, the ability to solve complex problems by combining solutions to simpler sub-problems, is a fundamental capability of both natural and artificial intelligence, and a key mechanism underlying chain-of-thought reasoning. However, the theoretical underpinnings of compositional generalization remain poorly understood: when and why does decomposing a problem into parts yield more efficient learning than solving it directly? We study this question through the canonical problem of learning to simulate semiautomata (predicting the outcome of $T$ steps of sequential computation), a model that captures state tracking, regular language recognition, and modular arithmetic. We show that an autocurriculum-based approach building on Part I of this series, recursively decomposing longer sequences into shorter sub-problems, learning to solve them, and composing the solutions, achieves dramatically better statistical complexity than direct methods. (i) For a setting inspired by supervised fine-tuning (SFT) where the learner receives interactive feedback on intermediate states of the computation, curriculum facilitates learning from only $2^{\mathcal{O}(\sqrt{\log T})}$ tokens of supervision; i.e., subpolynomial in the sequence length $T$, overcoming the $\Omega(T)$ token barrier required by direct simulation. (ii) For a setting inspired by reinforcement learning with verifiable rewards (RLVR), where the learner improves a pre-trained reference model using an outcome verifier, we show that curriculum reduces the requirement on the reference model from coverage at the full sequence length $T$ to coverage at a shorter block length $B \ll T$, an exponentially weaker condition.
Original Article
View Cached Full Text

Cached at: 06/29/26, 05:25 AM

# Learning to Reason with Curriculum II: Compositional Generalization
Source: [https://arxiv.org/abs/2606.27721](https://arxiv.org/abs/2606.27721)
[View PDF](https://arxiv.org/pdf/2606.27721)

> Abstract:Compositional generalization, the ability to solve complex problems by combining solutions to simpler sub\-problems, is a fundamental capability of both natural and artificial intelligence, and a key mechanism underlying chain\-of\-thought reasoning\. However, the theoretical underpinnings of compositional generalization remain poorly understood: when and why does decomposing a problem into parts yield more efficient learning than solving it directly? We study this question through the canonical problem of learning to simulate semiautomata \(predicting the outcome of $T$ steps of sequential computation\), a model that captures state tracking, regular language recognition, and modular arithmetic\. We show that an autocurriculum\-based approach building on Part I of this series, recursively decomposing longer sequences into shorter sub\-problems, learning to solve them, and composing the solutions, achieves dramatically better statistical complexity than direct methods\. \(i\) For a setting inspired by supervised fine\-tuning \(SFT\) where the learner receives interactive feedback on intermediate states of the computation, curriculum facilitates learning from only $2^\{\\mathcal\{O\}\(\\sqrt\{\\log T\}\)\}$ tokens of supervision; i\.e\., subpolynomial in the sequence length $T$, overcoming the $\\Omega\(T\)$ token barrier required by direct simulation\. \(ii\) For a setting inspired by reinforcement learning with verifiable rewards \(RLVR\), where the learner improves a pre\-trained reference model using an outcome verifier, we show that curriculum reduces the requirement on the reference model from coverage at the full sequence length $T$ to coverage at a shorter block length $B \\ll T$, an exponentially weaker condition\.

## Submission history

From: Nived Rajaraman \[[view email](https://arxiv.org/show-email/ba6853cb/2606.27721)\] **\[v1\]**Fri, 26 Jun 2026 05:09:08 UTC \(119 KB\)

Similar Articles

What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA

arXiv cs.CL

This paper empirically studies how the composition of training data (curriculum) affects the skills learned by RL-based memory agents in multi-session question answering. It finds that curriculum composition acts as a fine-grained lever on specialization, with mixed benchmarks yielding the best overall performance and narrow out-of-domain sets transferring targeted temporal reasoning skills.

Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR

Hugging Face Daily Papers

This paper proposes Transfer-Aware Curriculum (TAC), a bandit-style online curriculum for multi-domain RLVR that prioritizes domains whose updates benefit other domains using gradient-geometry alignment. TAC improves macro-averaged accuracy on Qwen3-1.7B and Llama3.2-3B over fixed and learnability-only curricula.

@askalphaxiv: A fascinating paper supervised by Yoshua Bengio "Generative Recursive Reasoning" Test time compute should scale not jus…

X AI KOLs Timeline

The paper 'Generative Recursive Reasoning' introduces a method that scales test-time compute by sampling multiple latent reasoning trajectories in parallel, enabling the model to explore diverse hypotheses and avoid deterministic collapse. This approach improves performance on tasks such as Sudoku, ARC AGI, N Queens, and graph coloring, and can also generate valid Sudoku boards and MNIST digits.

Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions

arXiv cs.AI

This paper evaluates three approaches (pure chain-of-thought reasoning, single-shot code execution, and iterative code execution) on 1,000 GSM-Symbolic problems using Claude Haiku 4.5, finding that chain-of-thought is the most robust to perturbation, while code execution does not improve reasoning robustness on grade-school math problems.