When Can LLMs Learn to Reason with Weak Supervision?
Summary
This paper systematically studies when LLMs can generalize in reasoning tasks under weak supervision (scarce data, noisy rewards, self-supervised proxy rewards), finding that reward saturation dynamics and reasoning faithfulness are key predictors, and that SFT on explicit reasoning traces is necessary for successful generalization under weak supervision.
View Cached Full Text
Cached at: 04/21/26, 07:20 AM
Paper page - When Can LLMs Learn to Reason with Weak Supervision?
Source: https://huggingface.co/papers/2604.18574
Abstract
Research reveals that model generalization in reasoning tasks under weak supervision depends on reward saturation dynamics and reasoning faithfulness, with supervised fine-tuning on explicit traces being crucial for successful adaptation.
Large language models have achieved significant reasoning improvements throughreinforcement learning with verifiable rewards(RLVR). Yet as model capabilities grow, constructing high-qualityreward signalsbecomes increasingly difficult, making it essential to understand when RLVR can succeed under weaker forms of supervision. We conduct a systematic empirical study across diverse model families and reasoning domains under threeweak supervisionsettings: scarce data, noisy rewards, and self-supervised proxy rewards. We find that generalization is governed by trainingreward saturation dynamics: models that generalize exhibit a prolonged pre-saturation phase during which training reward and downstream performance climb together, while models that saturate rapidly memorize rather than learn. We identifyreasoning faithfulness, defined as the extent to which intermediate steps logically support the final answer, as the pre-RL property that predicts which regime a model falls into, while output diversity alone is uninformative. Motivated by these findings, we disentangle the contributions ofcontinual pre-trainingandsupervised fine-tuning, finding that SFT onexplicit reasoning tracesis necessary for generalization underweak supervision, whilecontinual pre-trainingon domain data amplifies the effect. Applied together to Llama3.2-3B-Base, these interventions enable generalization across all three settings where the base model previously failed.
View arXiv pageView PDFProject pageGitHub1Add to collection
Get this paper in your agent:
hf papers read 2604\.18574
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.18574 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.18574 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.18574 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem
This empirical study evaluates LLMs on the Equivalence Class Problem to assess long-chain reasoning capabilities, finding that non-reasoning models fail while reasoning models struggle with specific structural difficulties.
When Do LLMs Reason? A Dynamical Systems View via Entropy Phase Transitions
This paper investigates when chain-of-thought reasoning is beneficial for LLMs, showing that early-stage entropy dynamics reliably indicate reasoning utility, and introduces EDRM, a lightweight, training-free framework that adaptively selects inference strategies to achieve significant token savings while maintaining or improving accuracy.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
This paper introduces ScaleLogic, a framework demonstrating that RL training compute scales as a power law with reasoning depth in LLMs. It highlights that logical expressiveness is key to improving downstream transfer and training efficiency.
Learning to reason with LLMs
OpenAI publishes an article exploring reasoning techniques with LLMs through cipher-decoding examples, demonstrating step-by-step problem-solving approaches and pattern recognition in language models.
@burny_tech: A Survey on Latent Reasoning "Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, especia…
This survey provides a comprehensive overview of latent reasoning in LLMs, exploring methods that perform multi-step inference in continuous hidden states without explicit token-level supervision.