When Can LLMs Learn to Reason with Weak Supervision?

Hugging Face Daily Papers 04/20/26, 12:00 AM Papers

llm reinforcement-learning weak-supervision reasoning fine-tuning rlvr reward-modeling

Summary

This paper systematically studies when LLMs can generalize in reasoning tasks under weak supervision (scarce data, noisy rewards, self-supervised proxy rewards), finding that reward saturation dynamics and reasoning faithfulness are key predictors, and that SFT on explicit reasoning traces is necessary for successful generalization under weak supervision.

Large language models have achieved significant reasoning improvements through reinforcement learning with verifiable rewards (RLVR). Yet as model capabilities grow, constructing high-quality reward signals becomes increasingly difficult, making it essential to understand when RLVR can succeed under weaker forms of supervision. We conduct a systematic empirical study across diverse model families and reasoning domains under three weak supervision settings: scarce data, noisy rewards, and self-supervised proxy rewards. We find that generalization is governed by training reward saturation dynamics: models that generalize exhibit a prolonged pre-saturation phase during which training reward and downstream performance climb together, while models that saturate rapidly memorize rather than learn. We identify reasoning faithfulness, defined as the extent to which intermediate steps logically support the final answer, as the pre-RL property that predicts which regime a model falls into, while output diversity alone is uninformative. Motivated by these findings, we disentangle the contributions of continual pre-training and supervised fine-tuning, finding that SFT on explicit reasoning traces is necessary for generalization under weak supervision, while continual pre-training on domain data amplifies the effect. Applied together to Llama3.2-3B-Base, these interventions enable generalization across all three settings where the base model previously failed.

Original Article

View Cached Full Text

Cached at: 04/21/26, 07:20 AM

Paper page - When Can LLMs Learn to Reason with Weak Supervision?

Source: https://huggingface.co/papers/2604.18574

Abstract

Research reveals that model generalization in reasoning tasks under weak supervision depends on reward saturation dynamics and reasoning faithfulness, with supervised fine-tuning on explicit traces being crucial for successful adaptation.

Large language models have achieved significant reasoning improvements throughreinforcement learning with verifiable rewards(RLVR). Yet as model capabilities grow, constructing high-qualityreward signalsbecomes increasingly difficult, making it essential to understand when RLVR can succeed under weaker forms of supervision. We conduct a systematic empirical study across diverse model families and reasoning domains under threeweak supervisionsettings: scarce data, noisy rewards, and self-supervised proxy rewards. We find that generalization is governed by trainingreward saturation dynamics: models that generalize exhibit a prolonged pre-saturation phase during which training reward and downstream performance climb together, while models that saturate rapidly memorize rather than learn. We identifyreasoning faithfulness, defined as the extent to which intermediate steps logically support the final answer, as the pre-RL property that predicts which regime a model falls into, while output diversity alone is uninformative. Motivated by these findings, we disentangle the contributions ofcontinual pre-trainingandsupervised fine-tuning, finding that SFT onexplicit reasoning tracesis necessary for generalization underweak supervision, whilecontinual pre-trainingon domain data amplifies the effect. Applied together to Llama3.2-3B-Base, these interventions enable generalization across all three settings where the base model previously failed.

View arXiv page View PDF Project page GitHub1 Add to collection

Get this paper in your agent:

hf papers read 2604\.18574

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.18574 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.18574 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.18574 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

When Can LLMs Learn to Reason with Weak Supervision?

Paper page - When Can LLMs Learn to Reason with Weak Supervision?

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem

When Do LLMs Reason? A Dynamical Systems View via Entropy Phase Transitions

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

Learning to reason with LLMs

@burny_tech: A Survey on Latent Reasoning "Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, especia…

Submit Feedback

Similar Articles

How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem

When Do LLMs Reason? A Dynamical Systems View via Entropy Phase Transitions

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

@burny_tech: A Survey on Latent Reasoning "Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, especia…