Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short

Hugging Face Daily Papers 06/08/26, 11:57 AM Papers

Summary

Reasoning Arena improves reinforcement learning with verifiable rewards by using trace tournaments and Bradley-Terry models to generate meaningful gradients from non-diverse reward groups, resulting in faster training and better reasoning performance.

Reinforcement learning with verifiable rewards (RLVR) has become a leading paradigm for improving the reasoning ability of large language models through outcome-based supervision. However, verifiable rewards frequently become uninformative at the group level: when all sampled traces of a given prompt receive identical rewards, group-relative advantage estimation provides no gradient signal, even though the traces may differ substantially in reasoning quality. We propose Reasoning Arena, an adaptive training framework that routes such non-diverse reward groups to a judge system instead of discarding them. Beyond examining the final answer, Reasoning Arena constructs trace tournaments, where reasoning traces are compared head-to-head to expose finer-grained preferences within the group, converting reasoning quality into rich relative reward signals. To make reward estimation efficient, rather than exhaustively comparing every pair, each new trace is evaluated against a small, dynamically updated pool of previously generated traces as anchors to efficiently establish a relative ranking. We then fit a Bradley-Terry model on the incomplete comparison graph, enabling scalable RL integration without quadratic pairwise comparisons. Empirical results demonstrate that Reasoning Arena consistently outperforms the RLVR baseline by 7.6% on average in competition mathematics and coding benchmarks. By converting otherwise wasted zero-advantage samples into useful gradient updates, our method accelerates training by 27% to 41%, saving nearly 50% of generation compute, and substantially improves overall reasoning performance.

Original Article

View Cached Full Text

Cached at: 06/09/26, 08:41 AM

Paper page - Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short

Source: https://huggingface.co/papers/2606.09380

Abstract

Reinforcement learningwithverifiable rewards(RLVR) has become a leading paradigm for improving thereasoning abilityof large language models throughoutcome-based supervision. However,verifiable rewardsfrequently become uninformative at the group level: when all sampled traces of a given prompt receive identical rewards,group-relative advantage estimationprovides no gradient signal, even though the traces may differ substantially in reasoning quality. We propose Reasoning Arena, an adaptive training framework that routes such non-diverse reward groups to ajudge systeminstead of discarding them. Beyond examining the final answer, Reasoning Arena constructstrace tournaments, where reasoning traces are compared head-to-head to expose finer-grained preferences within the group, converting reasoning quality into richrelative reward signals. To make reward estimation efficient, rather than exhaustively comparing every pair, each new trace is evaluated against a small, dynamically updated pool of previously generated traces as anchors to efficiently establish a relative ranking. We then fit aBradley-Terry modelon the incomplete comparison graph, enablingscalable RL integrationwithout quadratic pairwise comparisons. Empirical results demonstrate that Reasoning Arena consistently outperforms the RLVR baseline by 7.6% on average in competition mathematics and coding benchmarks. By converting otherwise wastedzero-advantage samplesinto useful gradient updates, our method accelerates training by 27% to 41%, saving nearly 50% of generation compute, and substantially improves overall reasoning performance.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2606\.09380

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.09380 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.09380 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.09380 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short

Paper page - Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

TRON: Targeted Rule-Verifiable Online Environments for Visual Reasoning RL

Video Models Can Reason with Verifiable Rewards

Stratagem: Learning Transferable Reasoning via Trajectory-Modulated Game Self-Play

GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards

Prioritizing the Best: Incentivizing Reliable Multimodal Reasoning by Rewarding Beyond Answer Correctness

Submit Feedback

Similar Articles

TRON: Targeted Rule-Verifiable Online Environments for Visual Reasoning RL

Video Models Can Reason with Verifiable Rewards

Stratagem: Learning Transferable Reasoning via Trajectory-Modulated Game Self-Play

GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards

Prioritizing the Best: Incentivizing Reliable Multimodal Reasoning by Rewarding Beyond Answer Correctness