LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

Hugging Face Daily Papers 05/29/26, 12:00 AM Papers

Summary

LongTraceRL introduces tiered distractor construction and rubric reward design to improve long-context reasoning in language models using reinforcement learning. The method generates multi-hop questions via knowledge graph random walks and uses search agent trajectories to build challenging distractors, with a rubric reward providing entity-level process supervision.

Long-context reasoning remains a central challenge for large language models, which often fail to locate and integrate key information in extensive distracting content. Reinforcement learning with verifiable rewards (RLVR) has shown promise for this task, yet existing methods are limited by low-confusability distractors and sparse, outcome-only reward signals that cannot supervise intermediate reasoning steps. To address these issues, we introduce LongTraceRL. For data construction, we generate multi-hop questions via knowledge graph random walks and leverage search agent trajectories to build tiered distractors: documents the agent read but did not cite (high confusability) and documents that appeared in search results but were never opened (low confusability), producing training contexts that are far more challenging than those built by random sampling or one-shot search. For reward design, we propose a rubric reward that uses the gold entities along each reasoning chain as fine-grained, entity-level process supervision. This rubric reward is applied only to responses with correct final answers (positive-only strategy), distinguishing the reasoning quality among correct responses and preventing reward hacking. Experiments on three reasoning LLMs (4B--30B) across five long-context benchmarks demonstrate that LongTraceRL consistently outperforms strong baselines and encourages comprehensive, evidence-grounded reasoning. Codes, datasets and models are available at https://github.com/THU-KEG/LongTraceRL{https://github.com/THU-KEG/LongTraceRL}.

Original Article

View Cached Full Text

Cached at: 06/01/26, 03:18 AM

Paper page - LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

Source: https://huggingface.co/papers/2605.31584

Abstract

LongTraceRL addresses long-context reasoning challenges in large language models through tiered distractor construction and rubric reward design for improved reasoning quality.

Long-context reasoningremains a central challenge forlarge language models, which often fail to locate and integrate key information in extensive distracting content.Reinforcement learning with verifiable rewards(RLVR) has shown promise for this task, yet existing methods are limited by low-confusability distractors and sparse, outcome-only reward signals that cannot supervise intermediate reasoning steps. To address these issues, we introduce LongTraceRL. For data construction, we generate multi-hop questions viaknowledge graph random walksand leveragesearch agent trajectoriesto buildtiered distractors: documents the agent read but did not cite (high confusability) and documents that appeared in search results but were never opened (low confusability), producing training contexts that are far more challenging than those built by random sampling or one-shot search. For reward design, we propose arubric rewardthat uses the gold entities along each reasoning chain as fine-grained, entity-level process supervision. Thisrubric rewardis applied only to responses with correct final answers (positive-only strategy), distinguishing the reasoning quality among correct responses and preventingreward hacking. Experiments on three reasoning LLMs (4B--30B) across five long-context benchmarks demonstrate that LongTraceRL consistently outperforms strong baselines and encourages comprehensive, evidence-grounded reasoning. Codes, datasets and models are available at https://github.com/THU-KEG/LongTraceRL{https://github.com/THU-KEG/LongTraceRL}.

View arXiv page View PDF GitHub4 Add to collection

Get this paper in your agent:

hf papers read 2605\.31584

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.31584 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.31584 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.31584 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

Paper page - LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

@rohanpaul_ai: Nice survey paper mapping agentic reinforcement learning for LLMs, showing how models learn by acting across time. Cove…

“My training rewards responses that feel satisfying”. At last some honesty

@verityw_: Generalist robot policies learn many useful skills. How can we elicit relevant behaviors when faced with new tasks? We …

Maxproof

Spent the weekend on the Apodex 4b, plus a quick look at the 35b mini

Submit Feedback

Similar Articles

@rohanpaul_ai: Nice survey paper mapping agentic reinforcement learning for LLMs, showing how models learn by acting across time. Cove…

“My training rewards responses that feel satisfying”. At last some honesty

@verityw_: Generalist robot policies learn many useful skills. How can we elicit relevant behaviors when faced with new tasks? We …

Spent the weekend on the Apodex 4b, plus a quick look at the 35b mini