TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning

Hugging Face Daily Papers 06/09/26, 05:16 PM Papers

Summary

TRACE is a unified rollout budget allocation framework that enhances reward contrast in multi-turn agentic reinforcement learning by dynamically distributing resources across tree-structured rollouts based on prefix-level informativeness. It improves efficiency and accuracy on agentic benchmarks like Multi-Hop QA.

Reinforcement learning with verifiable rewards (RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. However, rollout-intensive policy optimization is often limited by insufficient reward contrast, arising when overly simple or complex prompts generate low-variance feedback and when outcome-only rewards assign the same terminal assessment to every decision in a multi-turn rollout. Past efforts have focused on allocating available rollout resources to promising prompts, yet they only leverage sample informativeness at the prompt level and neglect variation in prefix-level informativeness across turns within the same rollout. This work targets multi-turn agentic RL by modeling each ReAct-style thought-action-observation turn as a semantically distinct node, allowing budget allocation to extend from prompt roots to turn-level prefixes with further continuations, which naturally forms tree-structured rollouts. We introduce Tree Rollout Allocation for Contrastive Exploration (TRACE), a unified rollout allocation framework that enhances reward contrast within a fixed sampling budget. Technically, TRACE allocates rollout budget to both prompt roots and intermediate prefixes that are most likely to yield mixed terminal rewards. A shared generalizable predictor estimates conditional success probability at these anchors from prefix histories to guide this allocation. The resulting adaptive tree structure enriches outcome-only feedback and amplifies the policy-update signal. Empirically, TRACE achieves competitive performance and efficiency gains on typical agentic benchmarks, e.g., improving Qwen3-14B Multi-Hop QA average accuracy by 2.8 points over competitive baselines at equal sampling cost.

Original Article

View Cached Full Text

Cached at: 06/11/26, 01:37 PM

Paper page - TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning

Source: https://huggingface.co/papers/2606.11119 Authors:

Abstract

TRACE is a rollout allocation framework that improves reward contrast in multi-turn agentic reinforcement learning by dynamically distributing resources across tree-structured rollouts based on prefix-level informativeness.

Reinforcement learningwithverifiable rewards(RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. However, rollout-intensivepolicy optimizationis often limited by insufficientreward contrast, arising when overly simple or complex prompts generate low-variance feedback and when outcome-only rewards assign the same terminal assessment to every decision in a multi-turn rollout. Past efforts have focused on allocating available rollout resources to promising prompts, yet they only leverage sample informativeness at the prompt level and neglect variation in prefix-level informativeness across turns within the same rollout. This work targets multi-turn agentic RL by modeling eachReAct-stylethought-action-observation turn as a semantically distinct node, allowing budget allocation to extend from prompt roots to turn-level prefixes with further continuations, which naturally formstree-structured rollouts. We introduce TreeRollout Allocationfor Contrastive Exploration (TRACE), a unifiedrollout allocationframework that enhancesreward contrastwithin a fixed sampling budget. Technically, TRACE allocates rollout budget to both prompt roots and intermediate prefixes that are most likely to yield mixed terminal rewards. A shared generalizable predictor estimatesconditional success probabilityat these anchors from prefix histories to guide this allocation. The resulting adaptive tree structure enriches outcome-only feedback and amplifies thepolicy-update signal. Empirically, TRACE achieves competitive performance and efficiency gains on typical agentic benchmarks, e.g., improving Qwen3-14B Multi-Hop QA average accuracy by 2.8 points over competitive baselines at equal sampling cost.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2606\.11119

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.11119 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.11119 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.11119 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning

Paper page - TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization

Cross-Epoch Adaptive Rollout Optimization for RL Post-Training

Read the Trace, Steer the Path: Trajectory-Aware Reinforcement Learning for Diffusion Language Models

TRACE: Trajectory Reasoning through Adaptive Cross-Step Evidence Aggregation for LLM Agents

Submit Feedback

Similar Articles

StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization

Cross-Epoch Adaptive Rollout Optimization for RL Post-Training

Read the Trace, Steer the Path: Trajectory-Aware Reinforcement Learning for Diffusion Language Models

TRACE: Trajectory Reasoning through Adaptive Cross-Step Evidence Aggregation for LLM Agents