Learning to Reason with Curriculum II: Compositional Generalization

arXiv cs.LG 06/29/26, 04:00 AM Papers

Summary

This paper theoretically analyzes how curriculum learning, by decomposing complex problems into simpler sub-problems and composing solutions, can dramatically reduce the sample complexity of learning to simulate sequential computations (semiautomata) compared to direct methods, achieving subpolynomial supervision requirements in supervised fine-tuning and exponentially weaker coverage conditions in reinforcement learning with verifiable rewards.

arXiv:2606.27721v1 Announce Type: new Abstract: Compositional generalization, the ability to solve complex problems by combining solutions to simpler sub-problems, is a fundamental capability of both natural and artificial intelligence, and a key mechanism underlying chain-of-thought reasoning. However, the theoretical underpinnings of compositional generalization remain poorly understood: when and why does decomposing a problem into parts yield more efficient learning than solving it directly? We study this question through the canonical problem of learning to simulate semiautomata (predicting the outcome of $T$ steps of sequential computation), a model that captures state tracking, regular language recognition, and modular arithmetic. We show that an autocurriculum-based approach building on Part I of this series, recursively decomposing longer sequences into shorter sub-problems, learning to solve them, and composing the solutions, achieves dramatically better statistical complexity than direct methods. (i) For a setting inspired by supervised fine-tuning (SFT) where the learner receives interactive feedback on intermediate states of the computation, curriculum facilitates learning from only $2^{\mathcal{O}(\sqrt{\log T})}$ tokens of supervision; i.e., subpolynomial in the sequence length $T$, overcoming the $\Omega(T)$ token barrier required by direct simulation. (ii) For a setting inspired by reinforcement learning with verifiable rewards (RLVR), where the learner improves a pre-trained reference model using an outcome verifier, we show that curriculum reduces the requirement on the reference model from coverage at the full sequence length $T$ to coverage at a shorter block length $B \ll T$, an exponentially weaker condition.

Original Article

View Cached Full Text

Cached at: 06/29/26, 05:25 AM

# Learning to Reason with Curriculum II: Compositional Generalization
Source: [https://arxiv.org/abs/2606.27721](https://arxiv.org/abs/2606.27721)
[View PDF](https://arxiv.org/pdf/2606.27721)

> Abstract:Compositional generalization, the ability to solve complex problems by combining solutions to simpler sub\-problems, is a fundamental capability of both natural and artificial intelligence, and a key mechanism underlying chain\-of\-thought reasoning\. However, the theoretical underpinnings of compositional generalization remain poorly understood: when and why does decomposing a problem into parts yield more efficient learning than solving it directly? We study this question through the canonical problem of learning to simulate semiautomata \(predicting the outcome of $T$ steps of sequential computation\), a model that captures state tracking, regular language recognition, and modular arithmetic\. We show that an autocurriculum\-based approach building on Part I of this series, recursively decomposing longer sequences into shorter sub\-problems, learning to solve them, and composing the solutions, achieves dramatically better statistical complexity than direct methods\. \(i\) For a setting inspired by supervised fine\-tuning \(SFT\) where the learner receives interactive feedback on intermediate states of the computation, curriculum facilitates learning from only $2^\{\\mathcal\{O\}\(\\sqrt\{\\log T\}\)\}$ tokens of supervision; i\.e\., subpolynomial in the sequence length $T$, overcoming the $\\Omega\(T\)$ token barrier required by direct simulation\. \(ii\) For a setting inspired by reinforcement learning with verifiable rewards \(RLVR\), where the learner improves a pre\-trained reference model using an outcome verifier, we show that curriculum reduces the requirement on the reference model from coverage at the full sequence length $T$ to coverage at a shorter block length $B \\ll T$, an exponentially weaker condition\.

## Submission history

From: Nived Rajaraman \[[view email](https://arxiv.org/show-email/ba6853cb/2606.27721)\] **\[v1\]**Fri, 26 Jun 2026 05:09:08 UTC \(119 KB\)

Learning to Reason with Curriculum II: Compositional Generalization

Similar Articles

Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization

What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA

Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR

@askalphaxiv: A fascinating paper supervised by Yoshua Bengio "Generative Recursive Reasoning" Test time compute should scale not jus…

Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions

Submit Feedback

Similar Articles

Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization

What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA

Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR

@askalphaxiv: A fascinating paper supervised by Yoshua Bengio "Generative Recursive Reasoning" Test time compute should scale not jus…

Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions