When Can LLMs Learn to Reason with Weak Supervision?

Hugging Face Daily Papers Papers

Summary

This paper systematically studies when LLMs can generalize in reasoning tasks under weak supervision (scarce data, noisy rewards, self-supervised proxy rewards), finding that reward saturation dynamics and reasoning faithfulness are key predictors, and that SFT on explicit reasoning traces is necessary for successful generalization under weak supervision.

Large language models have achieved significant reasoning improvements through reinforcement learning with verifiable rewards (RLVR). Yet as model capabilities grow, constructing high-quality reward signals becomes increasingly difficult, making it essential to understand when RLVR can succeed under weaker forms of supervision. We conduct a systematic empirical study across diverse model families and reasoning domains under three weak supervision settings: scarce data, noisy rewards, and self-supervised proxy rewards. We find that generalization is governed by training reward saturation dynamics: models that generalize exhibit a prolonged pre-saturation phase during which training reward and downstream performance climb together, while models that saturate rapidly memorize rather than learn. We identify reasoning faithfulness, defined as the extent to which intermediate steps logically support the final answer, as the pre-RL property that predicts which regime a model falls into, while output diversity alone is uninformative. Motivated by these findings, we disentangle the contributions of continual pre-training and supervised fine-tuning, finding that SFT on explicit reasoning traces is necessary for generalization under weak supervision, while continual pre-training on domain data amplifies the effect. Applied together to Llama3.2-3B-Base, these interventions enable generalization across all three settings where the base model previously failed.
Original Article
View Cached Full Text

Cached at: 04/21/26, 07:20 AM

Paper page - When Can LLMs Learn to Reason with Weak Supervision?

Source: https://huggingface.co/papers/2604.18574

Abstract

Research reveals that model generalization in reasoning tasks under weak supervision depends on reward saturation dynamics and reasoning faithfulness, with supervised fine-tuning on explicit traces being crucial for successful adaptation.

Large language models have achieved significant reasoning improvements throughreinforcement learning with verifiable rewards(RLVR). Yet as model capabilities grow, constructing high-qualityreward signalsbecomes increasingly difficult, making it essential to understand when RLVR can succeed under weaker forms of supervision. We conduct a systematic empirical study across diverse model families and reasoning domains under threeweak supervisionsettings: scarce data, noisy rewards, and self-supervised proxy rewards. We find that generalization is governed by trainingreward saturation dynamics: models that generalize exhibit a prolonged pre-saturation phase during which training reward and downstream performance climb together, while models that saturate rapidly memorize rather than learn. We identifyreasoning faithfulness, defined as the extent to which intermediate steps logically support the final answer, as the pre-RL property that predicts which regime a model falls into, while output diversity alone is uninformative. Motivated by these findings, we disentangle the contributions ofcontinual pre-trainingandsupervised fine-tuning, finding that SFT onexplicit reasoning tracesis necessary for generalization underweak supervision, whilecontinual pre-trainingon domain data amplifies the effect. Applied together to Llama3.2-3B-Base, these interventions enable generalization across all three settings where the base model previously failed.

View arXiv pageView PDFProject pageGitHub1Add to collection

Get this paper in your agent:

hf papers read 2604\.18574

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.18574 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.18574 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.18574 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

When Do LLMs Reason? A Dynamical Systems View via Entropy Phase Transitions

arXiv cs.LG

This paper investigates when chain-of-thought reasoning is beneficial for LLMs, showing that early-stage entropy dynamics reliably indicate reasoning utility, and introduces EDRM, a lightweight, training-free framework that adaptively selects inference strategies to achieve significant token savings while maintaining or improving accuracy.

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

Hugging Face Daily Papers

This paper introduces ScaleLogic, a framework demonstrating that RL training compute scales as a power law with reasoning depth in LLMs. It highlights that logical expressiveness is key to improving downstream transfer and training efficiency.

Learning to reason with LLMs

OpenAI Blog

OpenAI publishes an article exploring reasoning techniques with LLMs through cipher-decoding examples, demonstrating step-by-step problem-solving approaches and pattern recognition in language models.