ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

Hugging Face Daily Papers Papers

Summary

This paper introduces ResRL, a method to boost LLM reasoning by decoupling semantic distributions between positive and negative responses through negative sample projection. It aims to maintain generation diversity while improving performance on various benchmarks.

Reinforcement Learning with Verifiable Rewards (RLVR) enhances reasoning of Large Language Models (LLMs) but usually exhibits limited generation diversity due to the over-incentivization of positive rewards. Although methods like Negative Sample Reinforcement (NSR) mitigate this issue by upweighting penalty from negative samples, they may suppress the semantic distributions shared between positive and negative responses. To boost reasoning ability without losing diversity, this paper proposes negative sample projection Residual Reinforcement Learning (ResRL) that decouples similar semantic distributions among positive and negative responses. We theoretically link Lazy Likelihood Displacement (LLD) to negative-positive head-gradient interference and derive a single-forward proxy that upper-bounds representation alignment to guide conservative advantage reweighting. ResRL then projects negative-token hidden representations onto an SVD-based low-rank positive subspace and uses projection residuals to modulate negative gradients, improving reasoning while preserving diversity and outperforming strong baselines on average across twelve benchmarks spanning Mathematics, Code, Agent Tasks, and Function Calling. Notably, ResRL surpasses NSR on mathematical reasoning by 9.4\% in Avg@16 and 7.0\% in Pass@128. Code is available at https://github.com/1229095296/ResRL.git.
Original Article
View Cached Full Text

Cached at: 05/08/26, 08:04 AM

Paper page - ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

Source: https://huggingface.co/papers/2605.00380 Published on May 1

·

Submitted byhttps://huggingface.co/lin1111987

zihanon May 7

Abstract

ResRL improves LLM reasoning by decoupling semantic distributions between positive and negative responses through negative sample projection, maintaining diversity while outperforming existing methods on multiple benchmarks.

Reinforcement Learning with Verifiable Rewards(RLVR) enhances reasoning ofLarge Language Models(LLMs) but usually exhibits limited generation diversity due to the over-incentivization of positive rewards. Although methods like Negative Sample Reinforcement (NSR) mitigate this issue by upweighting penalty from negative samples, they may suppress the semantic distributions shared between positive and negative responses. To boost reasoning ability without losing diversity, this paper proposesnegative sample projectionResidual Reinforcement Learning(ResRL) that decouples similar semantic distributions among positive and negative responses. We theoretically linkLazy Likelihood Displacement(LLD) to negative-positive head-gradient interference and derive a single-forward proxy that upper-boundsrepresentation alignmentto guide conservativeadvantage reweighting. ResRL then projects negative-token hidden representations onto anSVD-based low-rank positive subspaceand uses projection residuals to modulate negative gradients, improving reasoning while preserving diversity and outperforming strong baselines on average across twelve benchmarks spanning Mathematics, Code, Agent Tasks, and Function Calling. Notably, ResRL surpasses NSR on mathematical reasoning by 9.4\% in Avg@16 and 7.0\% in Pass@128. Code is available at https://github.com/1229095296/ResRL.git.

View arXiv pageView PDFGitHub8Add to collection

Get this paper in your agent:

hf papers read 2605\.00380

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.00380 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.00380 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.00380 in a Space README.md to link it from this page.

Collections including this paper1

Similar Articles

ExpRL: Exploratory RL for LLM Mid-Training

Hugging Face Daily Papers

ExpRL is a new RL-based mid-training method that uses human-written reference solutions as dense reward scaffolds (never shown to the policy) to improve LLM reasoning, achieving significant gains on hard math benchmarks like AIME-2026.

Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs

arXiv cs.CL

This paper investigates whether reinforcement learning can improve the direct recall of parametric knowledge in LLMs beyond reasoning tasks. It demonstrates that RL with binary rewards yields significant gains in factual QA benchmarks by redistributing probability mass to unlock latent knowledge rather than acquiring new facts.

Learning to Refine Hidden States for Reliable LLM Reasoning

arXiv cs.LG

Proposes ReLAR, a reinforcement-guided latent refinement framework that iteratively updates hidden representations in LLMs before decoding, improving reasoning reliability and efficiency compared to chain-of-thought methods.