Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
Summary
This paper introduces ScaleLogic, a framework demonstrating that RL training compute scales as a power law with reasoning depth in LLMs. It highlights that logical expressiveness is key to improving downstream transfer and training efficiency.
View Cached Full Text
Cached at: 05/08/26, 07:32 AM
Paper page - Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
Source: https://huggingface.co/papers/2605.06638
Abstract
ScaleLogic demonstrates that reinforcement learning training compute scales as a power law with reasoning depth, with scaling exponents increasing monotonically with logical expressiveness across multiple reasoning tasks.
Reinforcement learning(RL) has been applied to improve large language model (LLM) reasoning, yet the systematic study of how training scales with task difficulty has been hampered by the lack of controlled, scalable environments. We introduce ScaleLogic, a syntheticlogical reasoningframework that offers independent control over two axes of difficulty: the depth of the requiredproof planning(i.e., thehorizon) and the expressiveness of the underlying logic. Our proposed framework supports a wide range of logics: from simple implication-only logic (“if-then”) towards more expressivefirst-order reasoningwith conjunction (“and”), disjunction (“or”), negation (“not”), and universal quantification (“for all”). Using this framework, we show that the RL training compute T follows a power law with respect to reasoning depth D (T propto D^γ, R^{2} > 0.99), and that thescaling exponentγ increases monotonically with logical expressiveness, from 1.04 to 2.60. On downstream mathematics and general reasoning benchmarks, more expressive training settings yield both larger performance gains (up to +10.66 points) and more compute-efficient transfer compared to less expressive settings, demonstrating that what a model is trained on, not just how much it is trained, shapes downstream transfer. We further show that the power-law relationship holds across multiple RL methods, andcurriculum-based trainingsubstantially improves scaling efficiency.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.06638
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.06638 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.06638 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.06638 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning
This paper challenges the assumption that RL teaches new reasoning capabilities to LLMs, arguing instead that it performs sparse policy selection at high-entropy decision points. It introduces ReasonMaxxer, an RL-free method that matches full RL performance with significantly lower training costs.
When Do LLMs Reason? A Dynamical Systems View via Entropy Phase Transitions
This paper investigates when chain-of-thought reasoning is beneficial for LLMs, showing that early-stage entropy dynamics reliably indicate reasoning utility, and introduces EDRM, a lightweight, training-free framework that adaptively selects inference strategies to achieve significant token savings while maintaining or improving accuracy.
Learning to reason with LLMs
OpenAI publishes an article exploring reasoning techniques with LLMs through cipher-decoding examples, demonstrating step-by-step problem-solving approaches and pattern recognition in language models.
How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem
This empirical study evaluates LLMs on the Equivalence Class Problem to assess long-chain reasoning capabilities, finding that non-reasoning models fail while reasoning models struggle with specific structural difficulties.
When Can LLMs Learn to Reason with Weak Supervision?
This paper systematically studies when LLMs can generalize in reasoning tasks under weak supervision (scarce data, noisy rewards, self-supervised proxy rewards), finding that reward saturation dynamics and reasoning faithfulness are key predictors, and that SFT on explicit reasoning traces is necessary for successful generalization under weak supervision.