ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning
Summary
This paper introduces ResRL, a method to boost LLM reasoning by decoupling semantic distributions between positive and negative responses through negative sample projection. It aims to maintain generation diversity while improving performance on various benchmarks.
View Cached Full Text
Cached at: 05/08/26, 08:04 AM
Paper page - ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning
Source: https://huggingface.co/papers/2605.00380 Published on May 1
·
Submitted byhttps://huggingface.co/lin1111987
zihanon May 7
Abstract
ResRL improves LLM reasoning by decoupling semantic distributions between positive and negative responses through negative sample projection, maintaining diversity while outperforming existing methods on multiple benchmarks.
Reinforcement Learning with Verifiable Rewards(RLVR) enhances reasoning ofLarge Language Models(LLMs) but usually exhibits limited generation diversity due to the over-incentivization of positive rewards. Although methods like Negative Sample Reinforcement (NSR) mitigate this issue by upweighting penalty from negative samples, they may suppress the semantic distributions shared between positive and negative responses. To boost reasoning ability without losing diversity, this paper proposesnegative sample projectionResidual Reinforcement Learning(ResRL) that decouples similar semantic distributions among positive and negative responses. We theoretically linkLazy Likelihood Displacement(LLD) to negative-positive head-gradient interference and derive a single-forward proxy that upper-boundsrepresentation alignmentto guide conservativeadvantage reweighting. ResRL then projects negative-token hidden representations onto anSVD-based low-rank positive subspaceand uses projection residuals to modulate negative gradients, improving reasoning while preserving diversity and outperforming strong baselines on average across twelve benchmarks spanning Mathematics, Code, Agent Tasks, and Function Calling. Notably, ResRL surpasses NSR on mathematical reasoning by 9.4\% in Avg@16 and 7.0\% in Pass@128. Code is available at https://github.com/1229095296/ResRL.git.
View arXiv pageView PDFGitHub8Add to collection
Get this paper in your agent:
hf papers read 2605\.00380
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.00380 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.00380 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.00380 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
ExpRL: Exploratory RL for LLM Mid-Training
ExpRL is a new RL-based mid-training method that uses human-written reference solutions as dense reward scaffolds (never shown to the policy) to improve LLM reasoning, achieving significant gains on hard math benchmarks like AIME-2026.
Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning
This paper challenges the assumption that RL teaches new reasoning capabilities to LLMs, arguing instead that it performs sparse policy selection at high-entropy decision points. It introduces ReasonMaxxer, an RL-free method that matches full RL performance with significantly lower training costs.
Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs
This paper investigates whether reinforcement learning can improve the direct recall of parametric knowledge in LLMs beyond reasoning tasks. It demonstrates that RL with binary rewards yields significant gains in factual QA benchmarks by redistributing probability mass to unlock latent knowledge rather than acquiring new facts.
Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR
This paper introduces RLRT, a method that reverses teacher signals in self-distillation to reinforce successful student deviations, enhancing reasoning exploration in large language models.
Learning to Refine Hidden States for Reliable LLM Reasoning
Proposes ReLAR, a reinforcement-guided latent refinement framework that iteratively updates hidden representations in LLMs before decoding, improving reasoning reliability and efficiency compared to chain-of-thought methods.