TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning
Summary
TRACE is a unified rollout budget allocation framework that enhances reward contrast in multi-turn agentic reinforcement learning by dynamically distributing resources across tree-structured rollouts based on prefix-level informativeness. It improves efficiency and accuracy on agentic benchmarks like Multi-Hop QA.
View Cached Full Text
Cached at: 06/11/26, 01:37 PM
Paper page - TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning
Source: https://huggingface.co/papers/2606.11119 Authors:
,
,
,
,
,
,
,
,
,
,
Abstract
TRACE is a rollout allocation framework that improves reward contrast in multi-turn agentic reinforcement learning by dynamically distributing resources across tree-structured rollouts based on prefix-level informativeness.
Reinforcement learningwithverifiable rewards(RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. However, rollout-intensivepolicy optimizationis often limited by insufficientreward contrast, arising when overly simple or complex prompts generate low-variance feedback and when outcome-only rewards assign the same terminal assessment to every decision in a multi-turn rollout. Past efforts have focused on allocating available rollout resources to promising prompts, yet they only leverage sample informativeness at the prompt level and neglect variation in prefix-level informativeness across turns within the same rollout. This work targets multi-turn agentic RL by modeling eachReAct-stylethought-action-observation turn as a semantically distinct node, allowing budget allocation to extend from prompt roots to turn-level prefixes with further continuations, which naturally formstree-structured rollouts. We introduce TreeRollout Allocationfor Contrastive Exploration (TRACE), a unifiedrollout allocationframework that enhancesreward contrastwithin a fixed sampling budget. Technically, TRACE allocates rollout budget to both prompt roots and intermediate prefixes that are most likely to yield mixed terminal rewards. A shared generalizable predictor estimatesconditional success probabilityat these anchors from prefix histories to guide this allocation. The resulting adaptive tree structure enriches outcome-only feedback and amplifies thepolicy-update signal. Empirically, TRACE achieves competitive performance and efficiency gains on typical agentic benchmarks, e.g., improving Qwen3-14B Multi-Hop QA average accuracy by 2.8 points over competitive baselines at equal sampling cost.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2606\.11119
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.11119 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.11119 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.11119 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction
StraTA proposes strategic trajectory abstraction for long-horizon LLM agents, using hierarchical GRPO-style rollout with diverse strategy sampling and critical self-judgment to improve sample efficiency and final performance over frontier models and prior RL baselines.
DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization
This paper proposes DRIFT, a framework that combines offline trajectories with importance-weighted supervised fine-tuning to efficiently achieve multi-turn interactive learning performance comparable to reinforcement learning.
Cross-Epoch Adaptive Rollout Optimization for RL Post-Training
This paper presents CERO, a cross-epoch adaptive rollout optimization method for RL post-training of LLMs, which allocates a fixed rollout budget across prompts and epochs using Bayesian posterior variance to maximize sample efficiency, achieving theoretical regret bounds and outperforming GRPO on mathematical reasoning tasks.
Read the Trace, Steer the Path: Trajectory-Aware Reinforcement Learning for Diffusion Language Models
This paper introduces CAPR (Cached-Amortized Path Refinement), a reinforcement learning algorithm for diffusion large language models that extracts tree-like supervision signals from the denoising trace without the compute cost of full tree rollouts. CAPR achieves state-of-the-art performance on reasoning benchmarks like GSM8K, Math500, Sudoku, and Countdown at roughly 0.75x the cost of flat rollouts.
TRACE: Trajectory Reasoning through Adaptive Cross-Step Evidence Aggregation for LLM Agents
TRACE is a monitoring framework for long-horizon LLM agent trajectories that uses a Triage-Inspect-Judge loop to connect evidence across temporally distant actions, achieving high recall and F1 on evasive sabotage detection tasks.