Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

Hugging Face Daily Papers Papers

Summary

This paper introduces ScaleLogic, a framework demonstrating that RL training compute scales as a power law with reasoning depth in LLMs. It highlights that logical expressiveness is key to improving downstream transfer and training efficiency.

Reinforcement learning (RL) has been applied to improve large language model (LLM) reasoning, yet the systematic study of how training scales with task difficulty has been hampered by the lack of controlled, scalable environments. We introduce ScaleLogic, a synthetic logical reasoning framework that offers independent control over two axes of difficulty: the depth of the required proof planning (i.e., the horizon) and the expressiveness of the underlying logic. Our proposed framework supports a wide range of logics: from simple implication-only logic ("if-then") towards more expressive first-order reasoning with conjunction ("and"), disjunction ("or"), negation ("not"), and universal quantification ("for all"). Using this framework, we show that the RL training compute T follows a power law with respect to reasoning depth D (T propto D^γ, R^{2} > 0.99), and that the scaling exponent γ increases monotonically with logical expressiveness, from 1.04 to 2.60. On downstream mathematics and general reasoning benchmarks, more expressive training settings yield both larger performance gains (up to +10.66 points) and more compute-efficient transfer compared to less expressive settings, demonstrating that what a model is trained on, not just how much it is trained, shapes downstream transfer. We further show that the power-law relationship holds across multiple RL methods, and curriculum-based training substantially improves scaling efficiency.
Original Article
View Cached Full Text

Cached at: 05/08/26, 07:32 AM

Paper page - Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

Source: https://huggingface.co/papers/2605.06638

Abstract

ScaleLogic demonstrates that reinforcement learning training compute scales as a power law with reasoning depth, with scaling exponents increasing monotonically with logical expressiveness across multiple reasoning tasks.

Reinforcement learning(RL) has been applied to improve large language model (LLM) reasoning, yet the systematic study of how training scales with task difficulty has been hampered by the lack of controlled, scalable environments. We introduce ScaleLogic, a syntheticlogical reasoningframework that offers independent control over two axes of difficulty: the depth of the requiredproof planning(i.e., thehorizon) and the expressiveness of the underlying logic. Our proposed framework supports a wide range of logics: from simple implication-only logic (“if-then”) towards more expressivefirst-order reasoningwith conjunction (“and”), disjunction (“or”), negation (“not”), and universal quantification (“for all”). Using this framework, we show that the RL training compute T follows a power law with respect to reasoning depth D (T propto D^γ, R^{2} > 0.99), and that thescaling exponentγ increases monotonically with logical expressiveness, from 1.04 to 2.60. On downstream mathematics and general reasoning benchmarks, more expressive training settings yield both larger performance gains (up to +10.66 points) and more compute-efficient transfer compared to less expressive settings, demonstrating that what a model is trained on, not just how much it is trained, shapes downstream transfer. We further show that the power-law relationship holds across multiple RL methods, andcurriculum-based trainingsubstantially improves scaling efficiency.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2605\.06638

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.06638 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.06638 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.06638 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

When Do LLMs Reason? A Dynamical Systems View via Entropy Phase Transitions

arXiv cs.LG

This paper investigates when chain-of-thought reasoning is beneficial for LLMs, showing that early-stage entropy dynamics reliably indicate reasoning utility, and introduces EDRM, a lightweight, training-free framework that adaptively selects inference strategies to achieve significant token savings while maintaining or improving accuracy.

Learning to reason with LLMs

OpenAI Blog

OpenAI publishes an article exploring reasoning techniques with LLMs through cipher-decoding examples, demonstrating step-by-step problem-solving approaches and pattern recognition in language models.

When Can LLMs Learn to Reason with Weak Supervision?

Hugging Face Daily Papers

This paper systematically studies when LLMs can generalize in reasoning tasks under weak supervision (scarce data, noisy rewards, self-supervised proxy rewards), finding that reward saturation dynamics and reasoning faithfulness are key predictors, and that SFT on explicit reasoning traces is necessary for successful generalization under weak supervision.