ExTra: Exploratory Trajectory Optimization for Language Model Reinforcement Learning

arXiv cs.LG Papers

Summary

ExTra introduces exploratory trajectory optimization for language model reinforcement learning, combining novelty rewards and entropy-guided prefix regeneration to improve both single-sample accuracy and inference-time coverage on mathematical reasoning benchmarks.

arXiv:2606.24994v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) for language-model reasoning can fail at both extremes of task difficulty: easy prompts often produce all-correct, low-diversity rollout groups with little gradient signal, while hard prompts can produce all-incorrect groups with no positive reward. We introduce ExTra (Exploratory Trajectory Optimization), a GRPO-compatible framework that extracts exploration signals from the model's own rollouts. ExTra combines two mechanisms: (i) a novelty reward that adds embedding-based diversity bonuses after GRPO normalization, rewarding diverse correct solutions; and (ii) entropy-guided prefix regeneration, which scores partial trajectories using entropy signals and continues exploration from promising intermediate steps. Across six mathematical reasoning benchmarks, ExTra improves Qwen3-1.7B over GRPO by about +5 points on pass@1 and +7 points on pass@16, showing that trajectory-level exploration signals can improve both single-sample accuracy and inference-time coverage.
Original Article
View Cached Full Text

Cached at: 06/25/26, 05:10 AM

# ExTra: Exploratory Trajectory Optimization for Language Model Reinforcement Learning
Source: [https://arxiv.org/html/2606.24994](https://arxiv.org/html/2606.24994)
Wenyang Hu1,2, Junxiang Jia2, Zhen Shu2, Daniel Dahlmeier2, See\-Kiong Ng1, Bryan Kian Hsiang Low1 1National University of Singapore,2SAP

###### Abstract

Reinforcement Learning with Verifiable Rewards \(RLVR\) for language\-model reasoning can fail at both extremes of task difficulty: easy prompts often produce all\-correct, low\-diversity rollout groups with little gradient signal, while hard prompts can produce all\-incorrect groups with no positive reward\. We introduceExTra\(Exploratory Trajectory Optimization\), a GRPO\-compatible framework that extracts exploration signals from the model’s own rollouts\. ExTra combines two mechanisms: \(i\) a novelty reward that adds embedding\-based diversity bonuses after GRPO normalization, rewarding diverse*correct*solutions; and \(ii\) entropy\-guided prefix regeneration, which scores partial trajectories using entropy signals and continues exploration from promising intermediate steps\. Across six mathematical reasoning benchmarks, ExTra improves Qwen3\-1\.7B over GRPO by about\+5points on pass@1 and\+7points on pass@16, showing that trajectory\-level exploration signals can improve both single\-sample accuracy and inference\-time coverage\. Code is available at:[https://github\.com/allen4747/extra](https://github.com/allen4747/extra)\.

ExTra: Exploratory Trajectory Optimization for Language Model Reinforcement Learning

Wenyang Hu††thanks:Correpondence to: wenyang\.hu@u\.nus\.edu1,2, Junxiang Jia2, Zhen Shu2, Daniel Dahlmeier2,See\-Kiong Ng1, Bryan Kian Hsiang Low11National University of Singapore,2SAP

## 1Introduction

Reinforcement learning has become a central post\-training paradigm for improving LLM reasoning capabilitiesOuyanget al\.\([2022](https://arxiv.org/html/2606.24994#bib.bib10)\); DeepSeek\-AI \([2025](https://arxiv.org/html/2606.24994#bib.bib16)\)\. Group Relative Policy Optimization \(GRPO\)Shaoet al\.\([2024](https://arxiv.org/html/2606.24994#bib.bib8)\)is a widely adopted variant that removes the learned value network by normalizing rewards within a sampled group of trajectories\. This design is effective, but it also exposes a structural weakness at the extremes of task difficulty, where group\-level rewards often become homogeneous\.

#### Two failure modes of GRPO\.

When a problem is*easy*, the model generates nearly\-uniform correct responses; the within\-group reward variance collapses to near zero, producing negligible gradients\. The policy then over\-optimizes for a narrow set of reasoning patterns, suppressing trajectory diversity\. This diversity collapse directly degrades pass@kkfork\>1k\>1, a metric that governs inference\-time scaling via majority voting or best\-of\-NNsamplingBrownet al\.\([2024](https://arxiv.org/html/2606.24994#bib.bib7)\)\. When a problem is*hard*, the group may contain no correct responses; all advantages vanish and the policy receives no gradient at all\. Together, these failure modes ensure that a substantial fraction of every training batch is either harmful or wasted, directly limiting sample efficiency\.

The recent effortsYuet al\.\([2025](https://arxiv.org/html/2606.24994#bib.bib9)\); Shrivastavaet al\.\([2026](https://arxiv.org/html/2606.24994#bib.bib6)\)\(e\.g\., DAPO\) propose mitigations such as difficulty\-based data filtering and over\-sampling, which \(1\) discard problems with empirical pass rates outside a target window and \(2\) keep sampling until the group accuracy is neither 0 nor 1\. However, this comes at a steep cost: hard problems, which carry the densest learning signal for frontier reasoning, are discarded entirely, and easy problems, which are retained, still erode diversity over time\. More fundamentally, resampling from scratch resets exploration to the beginning of the reasoning chain, ignoring trajectories that the model has already produced, which may already encode useful intermediate structure that informs the solution space\.

#### Exploiting latent signals in trajectories\.

This observation points toward a different perspective\. Rather than discarding existing trajectories or resampling blindly, we ask:*is there signal already embedded in the model’s own generation process that can guide exploration more efficiently?*During every rollout, the model produces two kinds of signal that previous approaches ignored: the*diversity*of previous trajectories, which could distinguish reasoning strategies that are genuinely novel from those that merely paraphrase; and the*structure*of failed trajectories, which encodes how well the intermediate states progressed\. We exploit both signals through two complementary mechanisms\.

With an embedding\-based*novelty reward*that encourages correct solutions that are semantically novel relative to those already seen for the same prompt\. We design novelty to be gated on correctness: only correct\-and\-novel solutions are rewarded, so the policy is never incentivized to produce diverse failures at the expense of task performance\. This restores the gradient signal on easy problems while preserving the stability of task reward optimization\.

For hard problems, rather than resampling from scratch and exploring the full trajectory space, we novelly propose to regenerate from a promising intermediate reasoning step, effectively continuing exploration from a step already deep in the search space\. We identify the most promising reasoning step and use it as a prefix for future generations\. Through an embedded signal in the LLM’s own generation process, Mean Token Entropy \(MTE\), the prefix selection requires no supervision and enables guided sampling\. It is a validated empirical choice by our experiments in Sec\.[5\.6](https://arxiv.org/html/2606.24994#S5.SS6)\.

Both mechanisms treat generated trajectories as a source of exploratory signal: novelty rewards optimize for*diversity*across trajectories on easy problems, while entropy\-guided regeneration optimizes for*structure*within trajectories on hard problems\. Together, they formExTra\(ExploratoryTrajectory Optimization\), our novel algorithm that improves LLM reinforcement learning by explicitly optimizing trajectories to be both diverse and structurally promising, operating entirely within the standard GRPO loop without sacrificing sample efficiency, additional reward models, or supervision beyond the binary correctness signal\.

We provide theoretical analysis to further support how regeneration helps efficient exploration in RLVR\. Our Sec\.[4\.4](https://arxiv.org/html/2606.24994#S4.SS4)elaborates that an informative proxy helps reduce the exploration space and guides exploration\. Our empirical analysis validates that there exists such an effective proxy, MTE, which requires no supervision and heavy computation\.

Extensive evaluation on six mathematical reasoning benchmarks \(MATH\-500, AMC23, Minerva, OlympiadBench, AIME24, and AIME25\) shows that ExTra improves Qwen3\-1\.7B over GRPO by about\+5on pass@1 and\+7on pass@16\. Our ablations and analyses show that the gains come from both improved solution quality and broader coverage of reasoning trajectories\.

![Refer to caption](https://arxiv.org/html/2606.24994v1/figures/overview.png)Figure 1:Overview of ExTra\.Top:On easy problems, a novelty reward is added to GRPO’s normalized advantage, steering the policy toward diverse correct reasoning strategies\.Bottom:On hard problems, all rollout prefixes are scored by Mean Token Entropy; The lowest\-entropy prefix is re\-queued as a guided prompt for the next batch, encouraging exploration from a promising intermediate reasoning step\.

## 2Related Work

#### RL for LLM Reasoning\.

PPOSchulmanet al\.\([2017](https://arxiv.org/html/2606.24994#bib.bib17)\)was first applied to align LLMs via RLHFOuyanget al\.\([2022](https://arxiv.org/html/2606.24994#bib.bib10)\)and later adapted for chain\-of\-thought mathematical reasoningWeiet al\.\([2022](https://arxiv.org/html/2606.24994#bib.bib18)\)\. GRPOShaoet al\.\([2024](https://arxiv.org/html/2606.24994#bib.bib8)\)replaces the value network with within\-group reward normalization; DeepSeek\-R1DeepSeek\-AI \([2025](https://arxiv.org/html/2606.24994#bib.bib16)\)showed that pure GRPO\-style RL can elicit strong reasoning from scratch\. DAPOYuet al\.\([2025](https://arxiv.org/html/2606.24994#bib.bib9)\)improves data efficiency by filtering on empirical pass rates, but discards hard problems entirely\. ExTra retains the full data distribution and instead modifies the advantage computation and rollout strategy to extract signal from both difficulty extremes\.

#### Intrinsic Motivation and Exploration Bonuses\.

Intrinsic rewards are a classical tool for encouraging exploration in sparse\-reward environmentsPathaket al\.\([2017](https://arxiv.org/html/2606.24994#bib.bib11)\); Burdaet al\.\([2019](https://arxiv.org/html/2606.24994#bib.bib12)\)\. ICMPathaket al\.\([2017](https://arxiv.org/html/2606.24994#bib.bib11)\)rewards prediction error in a learned feature space; RNDBurdaet al\.\([2019](https://arxiv.org/html/2606.24994#bib.bib12)\)uses the error of a fixed random network as a novelty proxy\. Adapting these ideas to LLM sequence generation requires a novelty measure that operates in the high\-dimensional space of natural language reasoning paths\.\(Daiet al\.,[2025](https://arxiv.org/html/2606.24994#bib.bib2)\)proposes using a novelty reward derived from multi\-head critics trained on the policy model\. In contrast, ExTra uses cosine distance between sentence embeddings, which requires no additional training\.

#### Process Supervision and Intermediate Rewards\.

Process reward models \(PRMs\) provide step\-level supervision for reasoningLightmanet al\.\([2023](https://arxiv.org/html/2606.24994#bib.bib13)\); Math\-ShepherdWanget al\.\([2024](https://arxiv.org/html/2606.24994#bib.bib15)\)automates this via Monte Carlo rollouts from each step\. PRMs are powerful but require either human annotation or a separately\-trained reward model, adding significant cost\. ExTra’s entropy heuristic serves the same role—identifying high\-quality intermediate states—while being computed at zero marginal cost from generation logits\.

#### Search and Guided Reasoning\.

Tree\-of\-ThoughtYaoet al\.\([2023](https://arxiv.org/html/2606.24994#bib.bib14)\)applies a systematic tree search over reasoning steps at*inference time*\. ExTra differs in two respects: it operates at*training time*, enriching the replay buffer with trajectories continuing from promising prefixes; and it uses a single lightweight selection step rather than a full tree expansion, making it compatible with standard RL pipelines\.

## 3Background

Let𝒙∈𝒳\{\\bm\{x\}\}\\in\\mathcal\{X\}denote a prompt and𝒚∈𝒴\{\\bm\{y\}\}\\in\\mathcal\{Y\}a response sampled from policyπθ\\pi\_\{\\theta\}\. In RLVR, the rewardR​\(𝒙,𝒚\)R\(\{\\bm\{x\}\},\{\\bm\{y\}\}\)is usually binary, indicating whether the final answer can be verified as correct\. The training objective is

maxθ⁡𝔼𝒙∼𝒟,𝒚∼πθ\(⋅\|𝒙\)​\[R​\(𝒙,𝒚\)\]\.\\max\_\{\\theta\}\\;\\mathbb\{E\}\_\{\{\\bm\{x\}\}\\sim\\mathcal\{D\},\\,\{\\bm\{y\}\}\\sim\\pi\_\{\\theta\}\(\\cdot\|\{\\bm\{x\}\}\)\}\\bigl\[R\(\{\\bm\{x\}\},\{\\bm\{y\}\}\)\\bigr\]\.
Policy\-gradient methods optimize this objective through an advantageA​\(𝒙,𝒚\)A\(\{\\bm\{x\}\},\{\\bm\{y\}\}\), which measures how much better a sampled response is than the policy’s expected response for the same prompt\. The policy\-gradient theoremSuttonet al\.\([1999](https://arxiv.org/html/2606.24994#bib.bib5)\)gives

∇θ𝔼​\[R​\(𝒙,𝒚\)\]∝𝔼​\[A​\(𝒙,𝒚\)​∇θlog⁡πθ​\(𝒚\|𝒙\)\]\.\\nabla\_\{\\theta\}\\,\\mathbb\{E\}\\bigl\[R\(\{\\bm\{x\}\},\{\\bm\{y\}\}\)\\bigr\]\\propto\\mathbb\{E\}\\bigl\[A\(\{\\bm\{x\}\},\{\\bm\{y\}\}\)\\,\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(\{\\bm\{y\}\}\|\{\\bm\{x\}\}\)\\bigr\]\.Thus, positive\-advantage responses are reinforced, negative\-advantage responses are suppressed, and near\-zero advantages produce little update\.

PPOSchulmanet al\.\([2017](https://arxiv.org/html/2606.24994#bib.bib17)\)typically estimates advantages with a learned value network\. GRPOShaoet al\.\([2024](https://arxiv.org/html/2606.24994#bib.bib8)\)removes this network by sampling a group ofGGresponses\{𝒚i\}i=1G\\\{\{\\bm\{y\}\}\_\{i\}\\\}\_\{i=1\}^\{G\}from the old policyπθold\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}for the same prompt𝒙\{\\bm\{x\}\}and standardizing their rewards:

A^i=ri−μrσr,\\hat\{A\}\_\{i\}=\\frac\{r\_\{i\}\-\\mu\_\{r\}\}\{\\sigma\_\{r\}\},whereri=R​\(𝒙,𝒚i\)r\_\{i\}=R\(\{\\bm\{x\}\},\{\\bm\{y\}\}\_\{i\}\), andμr\\mu\_\{r\}andσr\\sigma\_\{r\}are the group mean and standard deviation\. Whenσr=0\\sigma\_\{r\}=0, we use the standard zero\-advantage convention\. The policy is then updated with a clipped surrogate objective regularized by KL divergence to a reference policyπref\\pi\_\{\\mathrm\{ref\}\}; the full expression appears in Appendix[B](https://arxiv.org/html/2606.24994#A2)\.

## 4ExTra: Exploratory Trajectory Optimization

Fig\.[1](https://arxiv.org/html/2606.24994#S1.F1)illustrates the full ExTra framework\. The two mechanisms are modular\. Novelty rewards are applied at every training step, while prefix regeneration is triggered only for prompts whose within\-group pass rate falls below1/G1/G\(quantified as hard\)\.

### 4\.1Novelty\-Driven Intrinsic Rewards

For each prompt𝒙\{\\bm\{x\}\}, we maintain a rollout memoryΦ​\(𝒙\)\\Phi\(\{\\bm\{x\}\}\)containing sentence embeddings of previously generated responses and sibling responses from the same rollout group, computed by a lightweight embedding modelh​\(⋅\)h\(\\cdot\)\. The intrinsic reward for a new response𝐲\\mathbf\{y\}with embedding𝐳=h​\(𝐲\)\\mathbf\{z\}=h\(\\mathbf\{y\}\)is:

rint​\(𝐲\)=𝟙\[R​\(𝐱,𝐲\)\>0\]⋅\(1−max𝐳′∈Φ​\(𝐱\)⁡cos⁡\(𝐳,𝐳′\)\)\.r\_\{\\text\{int\}\}\(\\mathbf\{y\}\)=\\mathds\{1\}\_\{\[R\(\\mathbf\{x\},\\mathbf\{y\}\)\>0\]\}\\cdot\\Bigl\(1\-\\max\_\{\\mathbf\{z\}^\{\\prime\}\\in\\Phi\(\\mathbf\{x\}\)\}\\cos\(\\mathbf\{z\},\\mathbf\{z\}^\{\\prime\}\)\\Bigr\)\.\(1\)The correctness gate𝟙\[R\>0\]\\mathds\{1\}\_\{\[R\>0\]\}is essential: novelty is rewarded*only*among correct solutions\. An incorrect response receives no bonus, even if it differs from all previous responses, so ExTra does not incentivize diverse failures\.

#### Post\-normalization application\.

Where to apply the novelty reward matters\. Directly adding novelty to the raw reward before GRPO normalization couples exploration to the group mean and variance, so the contribution of the binary task reward is shrunk or distorted\. This is supported by our analysis in Appx[D\.1](https://arxiv.org/html/2606.24994#A4.SS1)\. We therefore propose to add novelty after group normalization:

A^ifinal=A^iGRPO\+γ​rint\(i\),\\hat\{A\}\_\{i\}^\{\\mathrm\{final\}\}=\\hat\{A\}\_\{i\}^\{\\mathrm\{GRPO\}\}\+\\gamma\\,r\_\{\\mathrm\{int\}\}^\{\(i\)\},\(2\)whereA^iGRPO=\(ri−μr\)/σr\\hat\{A\}\_\{i\}^\{\\mathrm\{GRPO\}\}=\(r\_\{i\}\-\\mu\_\{r\}\)/\\sigma\_\{r\}whenσr\>0\\sigma\_\{r\}\>0and0otherwise\. The coefficientγ\\gammahas a direct interpretation: it is the additional normalized advantage assigned to a maximally novel correct response relative to a minimally novel one\.

Our theoretical analysis in App\.[E\.3](https://arxiv.org/html/2606.24994#A5.SS3)shows that each unit of novelty reward corresponds to discovering a previously uncovered correct mode inΦ​\(𝒙\)\\Phi\(\{\\bm\{x\}\}\), justifying treatingrintr\_\{\\text\{int\}\}as a coverage signal rather than a generic diversity penalty\.

### 4\.2Mean Token Entropy as an Intermediate\-State Heuristic

When a rollout group contains no correct response, neither GRPO nor the correctness\-gated novelty reward provides a positive signal\. To recover useful information from such hard prompts, we need an unsupervised way to identify partial reasoning trajectories that are more likely to lead to a correct completion\.

We define theMean Token Entropy \(MTE\)of a partial trajectory𝒚≤t\{\\bm\{y\}\}\_\{\\leq t\}as

ℋmean\(𝒚≤t\)=1t∑s=1tH\(πθ\(⋅\|𝒙,𝒚<s\)\),\\mathcal\{H\}\_\{\\mathrm\{mean\}\}\(\{\\bm\{y\}\}\_\{\\leq t\}\)=\\frac\{1\}\{t\}\\sum\_\{s=1\}^\{t\}H\\\!\\bigl\(\\pi\_\{\\theta\}\(\\cdot\|\{\\bm\{x\}\},\{\\bm\{y\}\}\_\{<s\}\)\\bigr\),\(3\)whereH​\(π\)=−∑v∈𝒱π​\(v\)​log⁡π​\(v\)H\(\\pi\)=\-\\sum\_\{v\\in\\mathcal\{V\}\}\\pi\(v\)\\log\\pi\(v\)is the per\-step vocabulary entropy\. MTE measures the model’s average internal uncertainty while producing a prefix\. Our hypothesis is that a coherent reasoning chain, even if incomplete, tends to be generated with lower uncertainty than a drifting or inconsistent chain\. MTE is available at no additional forward\-pass cost because it is computed from rollout logits\.

Before using MTE operationally, we validate it in a controlled Monte Carlo study\. Section[5\.6](https://arxiv.org/html/2606.24994#S5.SS6)shows that MTE \(ρwithin=−0\.229\\rho\_\{\\mathrm\{within\}\}=\-0\.229,p=0\.002p=0\.002\) is the strongest within\-problem predictor of prefix value across 600 prefixes from hard problems, outperforming 20 alternative metrics, including several that use the ground\-truth answer\. This confirms our design choice\.

### 4\.3Entropy\-Guided Prefix Regeneration

For each hard prompt, we extract discrete intermediate reasoning steps from allGGrollouts and compute MTE for every prefix\. Individual prefix scores can be noisy, so we smooth them by semantic similarity:

ℋ~j=∑kwj​k​ℋk,wj​k=exp⁡\(cos⁡\(ej,ek\)/τ\)∑lexp⁡\(cos⁡\(ej,el\)/τ\),\\tilde\{\\mathcal\{H\}\}\_\{j\}=\\sum\_\{k\}w\_\{jk\}\\,\\mathcal\{H\}\_\{k\},\\;w\_\{jk\}=\\frac\{\\exp\\\!\\bigl\(\\cos\(e\_\{j\},e\_\{k\}\)/\\tau\\bigr\)\}\{\\sum\_\{l\}\\exp\\\!\\bigl\(\\cos\(e\_\{j\},e\_\{l\}\)/\\tau\\bigr\)\},whereℋj≡ℋmean​\(𝒑j\)\\mathcal\{H\}\_\{j\}\\equiv\\mathcal\{H\}\_\{\\mathrm\{mean\}\}\(\{\\bm\{p\}\}\_\{j\}\)is the raw MTE of prefix𝒑j\{\\bm\{p\}\}\_\{j\},ej=h​\(𝒑j\)e\_\{j\}=h\(\{\\bm\{p\}\}\_\{j\}\)is its embedding, andτ\\tauis a smoothing parameter\. The smoothed score lets semantically similar prefixes share information\. We select𝒑∗=arg⁡minj⁡ℋ~j\{\\bm\{p\}\}^\{\*\}=\\arg\\min\_\{j\}\\tilde\{\\mathcal\{H\}\}\_\{j\}, enqueue the guided prompt\[𝒙,𝒑∗\]\[\{\\bm\{x\}\},\{\\bm\{p\}\}^\{\*\}\], and inject it into a later training batch\. New suffixes are then sampled as𝒚\>t∼πθ\(⋅\|𝒙,𝒑∗\)\{\\bm\{y\}\}\_\{\>t\}\\sim\\pi\_\{\\theta\}\(\\cdot\|\{\\bm\{x\}\},\{\\bm\{p\}\}^\{\*\}\)\. The fixed prefix is treated as context: verification is applied to the concatenated trajectory\(𝒑∗,𝒚\>t\)\(\{\\bm\{p\}\}^\{\*\},\{\\bm\{y\}\}\_\{\>t\}\), while the policy\-gradient loss is computed only on newly generated suffix tokens\. This focuses exploration on an intermediate state that the model already reached with relatively low uncertainty, increasing the chance of obtaining a rewarded continuation\. The full procedure is given in Algorithm[1](https://arxiv.org/html/2606.24994#alg1)\(Appendix[A](https://arxiv.org/html/2606.24994#A1)\)\.

#### Interaction with novelty\.

Regeneration is most useful when the rollout pool contains multiple plausible reasoning paths\. The novelty reward helps maintain this pool; entropy\-guided regeneration then selects among diverse candidates instead of repeatedly amplifying a single low\-entropy pattern\. The ablations in Section[5\.7](https://arxiv.org/html/2606.24994#S5.SS7)test this interaction directly\.

Table 1:Qwen3\-1\.7B mathematical reasoning results\. pass@1 measures single\-sample accuracy; pass@16 measures inference\-time coverage across 16 samples\. Best results are bolded and second\-best results are underlined\.

### 4\.4Why Regeneration Helps: A Search\-Space Analysis

Regeneration admits a simple local justification on hard prompts\. Let𝒴∗​\(𝒙\)⊆𝒱L\\mathcal\{Y\}^\{\*\}\(\{\\bm\{x\}\}\)\\subseteq\\mathcal\{V\}^\{L\}denote the correct trajectories of lengthLLover vocabulary𝒱\\mathcal\{V\}\. Fixing the firsttttokens shrinks the reachable trajectory space from𝒱L\\mathcal\{V\}^\{L\}to𝒱L−t\\mathcal\{V\}^\{L\-t\}; the gain depends on whether the surviving slice contains correct trajectories at higher density than the original\. We compare two sampling strategies for the same policyπθ\\pi\_\{\\theta\}:scratch sampling, which generates a full trajectory𝒚∼πθ\(⋅\|𝒙\)\{\\bm\{y\}\}\\sim\\pi\_\{\\theta\}\(\\cdot\|\{\\bm\{x\}\}\)with success probabilityp0p\_\{0\}; andregeneration from prefixp\{\\bm\{p\}\}, which samples only the suffix𝒚\>t∼πθ\(⋅\|𝒙,𝒑\)\{\\bm\{y\}\}\_\{\>t\}\\sim\\pi\_\{\\theta\}\(\\cdot\|\{\\bm\{x\}\},\{\\bm\{p\}\}\)with conditional success probabilityq​\(𝒑\)q\(\{\\bm\{p\}\}\)\. The two are linked via the law of total probability:p0=𝔼𝒑∼πθ\(⋅\|𝒙\)​\[q​\(𝒑\)\]p\_\{0\}=\\mathbb\{E\}\_\{\{\\bm\{p\}\}\\sim\\pi\_\{\\theta\}\(\\cdot\|\{\\bm\{x\}\}\)\}\[q\(\{\\bm\{p\}\}\)\]\. To show how regeneration helps, we provide a search\-space analysis \(proof appears in App\.[E](https://arxiv.org/html/2606.24994#A5)\):

###### Proposition 1\(Regeneration efficiency\)\.

Withp0p\_\{0\}andq​\(⋅\)q\(\\cdot\)as above,p0=𝔼𝐩∼πθ\(⋅\|𝐱\)​\[q​\(𝐩\)\]p\_\{0\}=\\mathbb\{E\}\_\{\{\\bm\{p\}\}\\sim\\pi\_\{\\theta\}\(\\cdot\|\{\\bm\{x\}\}\)\}\[q\(\{\\bm\{p\}\}\)\], so at least one length\-ttprefix satisfiesq​\(𝐩\)≥p0q\(\{\\bm\{p\}\}\)\\geq p\_\{0\}\. For any selected𝐩∗\{\\bm\{p\}\}^\{\*\}withq​\(𝐩∗\)\>p0q\(\{\\bm\{p\}\}^\{\*\}\)\>p\_\{0\},KKindependent regenerated attempts yieldPregen​\(K\)=1−\(1−q​\(𝐩∗\)\)K\>1−\(1−p0\)K=Pscratch​\(K\)P\_\{\\mathrm\{regen\}\}\(K\)=1\-\(1\-q\(\{\\bm\{p\}\}^\{\*\}\)\)^\{K\}\>1\-\(1\-p\_\{0\}\)^\{K\}=P\_\{\\mathrm\{scratch\}\}\(K\)for everyK≥1K\\geq 1, and the expected attempts until the first correct trajectory drop from1/p01/p\_\{0\}to1/q​\(𝐩∗\)1/q\(\{\\bm\{p\}\}^\{\*\}\)\.

*Implication\.*Scratch sampling marginalizes over all length\-ttprefixes weighted by their generation probability; regeneration replaces this mixture with a single conditional slice\. Becausep0p\_\{0\}is the prefix\-weighted average ofq​\(𝒑\)q\(\{\\bm\{p\}\}\), an above\-average prefix \(q​\(𝒑\)\>p0q\(\{\\bm\{p\}\}\)\>p\_\{0\}\) is guaranteed to exist, and concentrating the budget on it strictly raises the per\-attempt success rate while exploring an exponentially smaller space\. The benefit therefore reduces to one empirical question: can we identify an above\-average prefix without ground\-truth supervision? Sec\.[5\.6](https://arxiv.org/html/2606.24994#S5.SS6)answers this in the affirmative for MTE, which beats many alternatives including supervised ones and identifies above\-average prefixes\. This is exactly the condition required by Prop\.[1](https://arxiv.org/html/2606.24994#Thmproposition1)\.

## 5Experiments

In this section, we evaluate whether ExTra improves RLVR training along three axes:\(i\)final reasoning accuracy, measured bypass@1;\(ii\)inference\-time coverage, measured bypass@16; and\(iii\)exploration efficiency, measured by diversity and rollout cost\.

#### Training setup\.

We train Qwen3\-1\.7BYanget al\.\([2025](https://arxiv.org/html/2606.24994#bib.bib4)\)and Nemontron\-1\.5B on MATH\-DAPOYuet al\.\([2025](https://arxiv.org/html/2606.24994#bib.bib9)\)\. Unless otherwise stated, all methods use group sizeG=6G=6, batch size512512, learning rate3×10−63\\times 10^\{\-6\}, and train for 250 steps\. Full training details, decoding parameters, and implementation details are given in App\.[B](https://arxiv.org/html/2606.24994#A2)\.

#### Evaluation\.

We evaluate on six mathematical reasoning benchmarks: MATH\-500, AMC23, Minerva, OlympiadBench, AIME24, and AIME25\. We report pass@1, the expected accuracy of a single sampled response, and pass@16, the fraction of problems for which at least one of 16 sampled responses is correct\. pass@16 is central to our setting because it captures whether additional sampling exposes correct trajectories that pass@1 may miss; the diversity analysis below tests whether these gains correspond to broader trajectory coverage rather than only higher single\-sample accuracy\.

#### Baselines and variants\.

We compare against standard GRPOShaoet al\.\([2024](https://arxiv.org/html/2606.24994#bib.bib8)\)and DAPOYuet al\.\([2025](https://arxiv.org/html/2606.24994#bib.bib9)\)\. To isolate the two mechanisms in ExTra, we also evaluate two ablations:ExTra\-R, which uses entropy\-guided prefix regeneration only, andExTra\-N, which uses the correctness\-gated novelty reward only\.ExTradenotes the full method\.

### 5\.1Main Results

Tab\.[1](https://arxiv.org/html/2606.24994#S4.T1)reports Qwen3\-1\.7B results for models trained by all five methods across the six benchmarks\. It shows ExTra achieves the highest pass@1 performance, lifting the average pass@1 from 47\.1% to 52\.0% \(\+4\.9\) on six benchmarks\. The improvement is consistent on every benchmark but heavily skewed toward the hardest benchmarks:\+8\.0on AIME24 and\+7\.7on AIME25\.

The pass@16 results show an even clearer exploration benefit\. Because pass@16 measures whether any of 16 sampled solutions is correct, it is sensitive to the breadth of the policy’s correct\-solution support as well as to single\-sample accuracy\. ExTra improves average pass@16 from 66\.4% to 73\.2% \(\+6\.8points\), with particularly large gains on AIME24 \(\+20\.0\) and AIME25 \(\+10\.0\)\. Together with the diversity analysis in Sec\.[5\.4](https://arxiv.org/html/2606.24994#S5.SS4), this suggests that ExTra does not merely sharpen one preferred reasoning path; it broadens the set of correct trajectories reachable at inference time\.

### 5\.2Mechanism Decomposition

The bottom rows of Table[1](https://arxiv.org/html/2606.24994#S4.T1)isolate the contribution of each mechanism\. Relative to GRPO, the regeneration\-only variant \(ExTra\-R\) improves average pass@1 by\+0\.9points but reduces average pass@16 by\-1\.2points\. This pattern is consistent with the role of regeneration: conditioning on selected prefixes can improve solution quality, but by itself it can also narrow the set of explored trajectories\.

ExTra\-N, which uses only the correctness\-gated novelty reward, achieves consistently higher performance on both average pass@1 \(\+4\.0\) and pass@16 \(\+4\.2\)\. This confirms that rewarding correct\-but\-novel trajectories directly counteracts the diversity collapse induced by homogeneous correct groups\.

The full method combines the strengths of both components\. Its average pass@16 gain over GRPO is\+6\.8, exceeding the sum suggested by the individual deltas \(−1\.2\+4\.2=\+3\.0\-1\.2\+4\.2=\+3\.0\)\. This super\-additive effect matches the intended interaction: novelty expands the pool of plausible reasoning paths, and regeneration then selects low\-entropy prefixes from that richer pool\. Without novelty, regeneration can repeatedly amplify similar prefixes; with novelty, it can exploit a broader set of candidate states\.

### 5\.3Transfer Across Base Models

To test whether the gains are specific to Qwen3\-1\.7B, we repeat the main comparison on Nemotron\-1\.5B\. Table[2](https://arxiv.org/html/2606.24994#S5.T2)shows that ExTra again achieves the best average pass@1 and pass@16\. The absolute gains are smaller than on Qwen3\-1\.7B, but the direction is consistent: average pass@1 increases from 57\.9% to 58\.9%, and average pass@16 increases from 77\.2% to 79\.8%\. The strongest gains again appear on harder benchmarks, including\+4\.4pass@1 and\+6\.6pass@16 on AIME24\.

Table 2:Nemontron\-1\.5B mathematical reasoning results\. pass@1 measures single\-sample accuracy; pass@16 measures inference\-time coverage across 16 samples\. Best results are bolded and second\-best results are underlined\.
### 5\.4Diversity and Training Dynamics

The pass@16 gains suggest that ExTra improves inference\-time coverage\. We verify this directly by measuring diversity among trajectories generated for the same problem\. Table[3](https://arxiv.org/html/2606.24994#S5.T3)reports three complementary metrics averaged across the six benchmarks:*Distinct\-4*\(↑\\uparrow\), the fraction of unique 4\-grams across responses;*Self\-BLEU*\(↓\\downarrow\), the mean pairwise BLEU\-4 between responsesZhuet al\.\([2018](https://arxiv.org/html/2606.24994#bib.bib1)\); and*LogDet*\(↑\\uparrow\), the determinantal point process log\-volumelog​det\(I\+α​K\)\\log\\det\(I\+\\alpha K\), whereKKis the cosine\-similarity kernel of MiniLM\-L6 trajectory embeddings\.

ExTra achieves the best Self\-BLEU and LogDet, indicating that its responses are less redundant and occupy a broader semantic region\. DAPO has slightly higher Distinct\-4 \(0\.59 vs\. 0\.58\), but this metric mainly captures surface\-level lexical variation\. The stronger semantic measures favor ExTra, matching the pass@16 results and supporting the claim that ExTra improves trajectory coverage and diversity\.

Table 3:Diversity comparison using different diversity measures averaged across six benchmarks\.Figure[2](https://arxiv.org/html/2606.24994#S5.F2)provides a complementary training\-time view for diversity\. ExTra maintains a non\-trivial novelty signal throughout the training steps while avoiding the entropy collapse observed under plain GRPO\. The final pass@16 improvements therefore do not appear to be a transient early\-training effect; they reflect a sustained change in the policy’s exploration behavior\.

![Refer to caption](https://arxiv.org/html/2606.24994v1/x1.png)Figure 2:Training dynamics of policy entropy and novelty reward\. ExTra sustains an active novelty signal while avoiding the entropy collapse associated with homogeneous rollouts\.
### 5\.5Sample Efficiency

A central goal of ExTra is to improve exploration without the large rollout overhead of dynamic resampling\. We therefore compare methods by the*cumulative number of generated prompt instances*, counted before multiplying by the group sizeGG\. Each prompt instance producesGGresponses, and we count it whether or not it contributes to an update\. GRPO and ExTra remain close to one prompt batch per training step, whereas DAPO resamples groups whose rewards collapse to a single value; in our run, DAPO requires an average of3\.053\.05generation batches per step\.

Fig\.[4](https://arxiv.org/html/2606.24994#S5.F4)plots average pass@16 against this prompt budget at step 250\. ExTra reaches73\.2%73\.2\\%average pass@16 using136136k generated prompt instances\. DAPO uses392392k prompt instances but reaches only63\.6%63\.6\\%, while GRPO reaches66\.4%66\.4\\%with128128k prompt instances\. Thus, ExTra improves over GRPO at a similar rollout cost and outperforms DAPO on both accuracy and sample efficiency\.

![Refer to caption](https://arxiv.org/html/2606.24994v1/x2.png)Figure 3:Sample efficiency comparison at training phase, measured by the cumulative number of prompts sampled\.
![Refer to caption](https://arxiv.org/html/2606.24994v1/x3.png)Figure 4:Selecting low\-entropy prefixes improves pass@kkover random selection, and smoothed entropy further improves it\.

### 5\.6Validating Mean Token Entropy as a Prefix Heuristic

The validity of entropy\-guided regeneration hinges on whether MTE reliably identifies prefixes likely to lead to correct completions\. We test this directly in a controlled Monte Carlo study before using the heuristic inside the training loop\.

#### Setup\.

We extract 600 prefixes from level 4–5 MATH\-500 problems\. For each prefix, we sampleK=16K=16continuations and use the empirical pass rate as an estimate of the prefix’s Monte Carlo value\. We then evaluate 21 candidate prefix metrics, including unsupervised signals such as entropy, perplexity, and layer consistency, as well as ground\-truth\-aware signals such as conditional perplexity to the gold answer and semantic similarity to the gold solution\. We rank metrics by Spearman correlation with the Monte Carlo value\. The primary statistic is within\-problem correlation, because regeneration selects among prefixes belonging to the same prompt\. Full protocol details appear in Appx\.[C\.1](https://arxiv.org/html/2606.24994#A3.SS1)\.

#### Results\.

Table[4](https://arxiv.org/html/2606.24994#S5.T4)reports a representative subset of the metrics\. MTE is the strongest predictor, achievingρwithin=−0\.229\\rho\_\{\\mathrm\{within\}\}=\-0\.229\(p=0\.002p=0\.002\)\. The absolute correlation is moderate, reflecting the inherent noisiness of Monte Carlo prefix evaluation, but its magnitude is more than twice as large as the strongest ground\-truth\-aware metric in the full appendix table\. This is notable because several of those baselines use the gold answer, whereas MTE is label\-free, requires no reward model, and is computed from logits already produced during rollout\.

Table 4:Spearman correlation \(ρ\\rho\) between prefix metrics and Monte Carlo pass rate over 600 prefixes from 22 problems\. Within\-problem correlation controls for problem\-level difficulty and matches the setting in which regeneration selects prefixes\. Full table in App\.[C\.2](https://arxiv.org/html/2606.24994#A3.SS2)\.These findings support the empirical condition required by Proposition[1](https://arxiv.org/html/2606.24994#Thmproposition1): when MTE selects above\-average prefixes, regeneration can increase the conditional success probability relative to sampling from scratch\. Figure[4](https://arxiv.org/html/2606.24994#S5.F4)confirms this operationally: low\-entropy selection improves over random prefix selection, and embedding\-smoothed entropy closes much of the gap to oracle selection\.

### 5\.7Ablation Studies

We conduct two additional ablations to test whether the novelty reward is robust to its coefficient and whether the form of the novelty signal matters\.

#### Sensitivity to the novelty coefficient\.

Table[5](https://arxiv.org/html/2606.24994#S5.T5)sweeps the intrinsic\-reward coefficientγ\\gammawhile keeping the regeneration mechanism fixed\. Settingγ=0\\gamma=0recovers ExTra\-R\. Moderate novelty pressure improves both pass@1 and pass@16, with the defaultγ=1\.0\\gamma=1\.0giving the best macro\-average result\. A much larger value,γ=5\.0\\gamma=5\.0, substantially hurts performance, suggesting that novelty should complement rather than dominate the extrinsic correctness reward\.

Table 5:Sensitivity to the intrinsic\-reward coefficientγ\\gamma\. Macro\-averaged pass@1 / pass@16 over the six benchmarks\.
#### Choice of novelty signal\.

Table[6](https://arxiv.org/html/2606.24994#S5.T6)compares three novelty signals on top of entropy\-guided regeneration: policy surprise, 4\-gram novelty, and the default embedding\-based novelty\. Embedding novelty performs best, especially on pass@16\. Compared with 4\-gram novelty, it is only \+1\.01\.0pp better on pass@1 but \+5\.05\.0pp better on pass@16\. This gap suggests that lexical novelty is not enough: surface\-level variation can produce paraphrased solutions that remain close in reasoning space, whereas embedding novelty better captures semantically distinct solution paths\.

Table 6:Novelty\-signal ablation\. Macro\-averaged pass@1 and pass@16\.

## 6Discussion

Trajectory\-level signals matter\.ExTra is motivated by a simple observation: GRPO often discards useful trajectory\-level information when rewards collapse to all\-correct or all\-incorrect groups\. Rather than filtering these cases away, ExTra extracts signal from them\. Correctness\-gated novelty separates diverse correct solutions on easy prompts, while entropy\-guided regeneration reuses promising partial solutions on hard prompts\. This makes exploration an explicit part of the RL objective rather than only a data selection problem\.

The design avoids degenerate exploration\.Novelty does not encourage arbitrary behavior because the intrinsic bonus is applied only to correct responses\. Regeneration alone can narrow the search, but the full method mitigates this by pairing regeneration with novelty: novelty broadens the prefix pool, and regeneration focuses sampling within that broader pool\. The ablations support this interaction\.

ExTra improves reasoning coverage\.Overall, the results suggest that ExTra improves not only final\-answer accuracy but also the coverage of correct reasoning trajectories\. Its main significance is therefore conceptual as well as empirical: generated rollouts should be treated as sources of exploration signal, not merely as samples to be scored by a binary reward\.

## 7Conclusion

ExTra is an RLVR training framework that improves exploration in GRPO\-style optimization\. It addresses two common failure modes: diversity collapse on easy prompts and gradient starvation on hard prompts\. It combines a correctness\-gated novelty reward with entropy\-guided prefix regeneration, allowing the model to use signals already present in its own rollouts\. Experiments on mathematical reasoning benchmarks show that ExTra improves both pass@1 accuracy and pass@kkcoverage over standard baselines, suggesting that RLVR can benefit from treating generated trajectories as sources of exploration signal\.

## References

- B\. Brown, J\. Juravsky, R\. Ehrlich, R\. Clark, Q\. V\. Le, C\. Ré, and A\. Mirhoseini \(2024\)Large language monkeys: scaling inference compute with repeated sampling\.arXiv preprint arXiv:2407\.21787\.Cited by:[§1](https://arxiv.org/html/2606.24994#S1.SS0.SSS0.Px1.p1.3)\.
- Y\. Burda, H\. Edwards, A\. Storkey, and O\. Klimov \(2019\)Exploration by random network distillation\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2606.24994#S2.SS0.SSS0.Px2.p1.1)\.
- R\. Dai, L\. Song, H\. Liu, Z\. Liang, D\. Yu, H\. Mi, Z\. Tu, R\. Liu, T\. Zheng, H\. Zhu,et al\.\(2025\)Cde: curiosity\-driven exploration for efficient reinforcement learning in large language models\.arXiv preprint arXiv:2509\.09675\.Cited by:[§2](https://arxiv.org/html/2606.24994#S2.SS0.SSS0.Px2.p1.1)\.
- DeepSeek\-AI \(2025\)DeepSeek\-R1: incentivizing reasoning capability in LLMs via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§1](https://arxiv.org/html/2606.24994#S1.p1.1),[§2](https://arxiv.org/html/2606.24994#S2.SS0.SSS0.Px1.p1.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2023\)Let’s verify step by step\.arXiv preprint arXiv:2305\.20050\.Cited by:[§2](https://arxiv.org/html/2606.24994#S2.SS0.SSS0.Px3.p1.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray,et al\.\(2022\)Training language models to follow instructions with human feedback\.Advances in Neural Information Processing Systems35,pp\. 27730–27744\.Cited by:[§1](https://arxiv.org/html/2606.24994#S1.p1.1),[§2](https://arxiv.org/html/2606.24994#S2.SS0.SSS0.Px1.p1.1)\.
- D\. Pathak, P\. Agrawal, A\. A\. Efros, and T\. Darrell \(2017\)Curiosity\-driven exploration by self\-supervised prediction\.InProceedings of the 34th International Conference on Machine Learning \(ICML\),pp\. 2778–2787\.Cited by:[§2](https://arxiv.org/html/2606.24994#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§2](https://arxiv.org/html/2606.24994#S2.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2606.24994#S3.p3.4)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§1](https://arxiv.org/html/2606.24994#S1.p1.1),[§2](https://arxiv.org/html/2606.24994#S2.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2606.24994#S3.p3.4),[§5](https://arxiv.org/html/2606.24994#S5.SS0.SSS0.Px3.p1.1)\.
- V\. Shrivastava, A\. H\. Awadallah, V\. Balachandran, S\. Garg, H\. Behl, and D\. Papailiopoulos \(2026\)Sample more to think less: group filtered policy optimization for concise reasoning\.InProc\. ICLR,Cited by:[§1](https://arxiv.org/html/2606.24994#S1.SS0.SSS0.Px1.p2.1)\.
- R\. S\. Sutton, D\. McAllester, S\. Singh, and Y\. Mansour \(1999\)Policy gradient methods for reinforcement learning with function approximation\.InProc\. NeurIPS,Vol\.12\.Cited by:[§3](https://arxiv.org/html/2606.24994#S3.p2.1)\.
- P\. Wang, L\. Li, Z\. Shao, R\. Xu, D\. Dai, Y\. Li, D\. Chen, Y\. Wu, and Z\. Sui \(2024\)Math\-shepherd: verify and reinforce LLMs step\-by\-step without human annotations\.arXiv preprint arXiv:2312\.08935\.Cited by:[§2](https://arxiv.org/html/2606.24994#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. Chi, Q\. V\. Le, and D\. Zhou \(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in Neural Information Processing Systems35,pp\. 24824–24837\.Cited by:[§2](https://arxiv.org/html/2606.24994#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§5](https://arxiv.org/html/2606.24994#S5.SS0.SSS0.Px1.p1.3)\.
- S\. Yao, D\. Yu, J\. Zhao, I\. Shafran, T\. L\. Griffiths, Y\. Cao, and K\. Narasimhan \(2023\)Tree of thoughts: deliberate problem solving with large language models\.Advances in Neural Information Processing Systems36\.Cited by:[§2](https://arxiv.org/html/2606.24994#S2.SS0.SSS0.Px4.p1.1)\.
- Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, Y\. Yue, W\. Dai, T\. Fan, G\. Liu, L\. Liu,et al\.\(2025\)Dapo: an open\-source llm reinforcement learning system at scale\.arXiv preprint arXiv:2503\.14476\.Cited by:[§1](https://arxiv.org/html/2606.24994#S1.SS0.SSS0.Px1.p2.1),[§2](https://arxiv.org/html/2606.24994#S2.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2606.24994#S5.SS0.SSS0.Px1.p1.3),[§5](https://arxiv.org/html/2606.24994#S5.SS0.SSS0.Px3.p1.1)\.
- Y\. Zhu, S\. Lu, L\. Zheng, J\. Guo, W\. Zhang, J\. Wang, and Y\. Yu \(2018\)Texygen: a benchmarking platform for text generation models\.InThe 41st international ACM SIGIR conference on research & development in information retrieval,pp\. 1097–1100\.Cited by:[§5\.4](https://arxiv.org/html/2606.24994#S5.SS4.p1.5)\.

## Appendix AFull ExTra Algorithm

Algorithm[1](https://arxiv.org/html/2606.24994#alg1)gives the full training procedure for ExTra\. The algorithm has two additions to standard GRPO: a post\-normalization novelty bonus for correct trajectories and a prefix\-regeneration queue for hard prompts\.

Algorithm 1ExTra: Exploratory Trajectory Optimization0:Policy

πθ\\pi\_\{\\theta\}, prompt set

𝒟\\mathcal\{D\}, embedding model

hh, correct\-response memory

ℳr\\mathcal\{M\}\_\{\\mathrm\{r\}\}, prefix memory

ℳp\\mathcal\{M\}\_\{\\mathrm\{p\}\}, guided\-prompt queue

𝒬\\mathcal\{Q\}, novelty weight

γ\\gamma, smoothing temperature

τ\\tau, warmup length

TwarmT\_\{\\mathrm\{warm\}\}
1:fortraining step

t=1,2,…t=1,2,\\ldotsdo

2:Sample prompts

X∼𝒟X\\sim\\mathcal\{D\}
3:if

𝒬\\mathcal\{Q\}is not emptythen

4:Replace the tail of

XXwith guided prompts dequeued from

𝒬\\mathcal\{Q\}
5:endif

6:Generate

GGresponses per prompt using

πθ\\pi\_\{\\theta\}
7:Compute extrinsic rewards

RextR\_\{\\mathrm\{ext\}\}and token\-level entropies

8:Embed all responses with

hh
9:foreach prompt

xxwith responses

\{yi\}i=1G\\\{y\_\{i\}\\\}\_\{i=1\}^\{G\}do

10:Let

zi=h​\(yi\)z\_\{i\}=h\(y\_\{i\}\)
11:Let

𝒞i​\(x\)\\mathcal\{C\}\_\{i\}\(x\)be the set of embeddings in

ℳr​\(x\)\\mathcal\{M\}\_\{\\mathrm\{r\}\}\(x\)plus correct sibling\-response embeddings from the current group, excluding

ziz\_\{i\}
12:Compute correctness\-gated novelty using cosine similarity:

rint\(i\)=𝟙\[Rext\(i\)\>0\]​\(1−maxz′∈𝒞i​\(x\)⁡cos⁡\(zi,z′\)\),max⁡∅=0r\_\{\\mathrm\{int\}\}^\{\(i\)\}=\\mathds\{1\}\_\{\[R\_\{\\mathrm\{ext\}\}^\{\(i\)\}\>0\]\}\\left\(1\-\\max\_\{z^\{\\prime\}\\in\\mathcal\{C\}\_\{i\}\(x\)\}\\cos\(z\_\{i\},z^\{\\prime\}\)\\right\),\\qquad\\max\\varnothing=0
13:Update

ℳr​\(x\)\\mathcal\{M\}\_\{\\mathrm\{r\}\}\(x\)with the current correct\-response embeddings, keeping the most recent memory entries

14:endfor

15:Compute GRPO advantages

A^iGRPO=\(Rext\(i\)−μgroup\)/σgroup\\hat\{A\}\_\{i\}^\{\\mathrm\{GRPO\}\}=\(R\_\{\\mathrm\{ext\}\}^\{\(i\)\}\-\\mu\_\{\\mathrm\{group\}\}\)/\\sigma\_\{\\mathrm\{group\}\}, setting them to zero when

σgroup=0\\sigma\_\{\\mathrm\{group\}\}=0
16:Add novelty after normalization:

A^i=A^iGRPO\+γ​rint\(i\)\\hat\{A\}\_\{i\}=\\hat\{A\}\_\{i\}^\{\\mathrm\{GRPO\}\}\+\\gamma r\_\{\\mathrm\{int\}\}^\{\(i\)\}
17:Update

πθ\\pi\_\{\\theta\}with the clipped GRPO objective using

A^i\\hat\{A\}\_\{i\}
18:if

t\>Twarmt\>T\_\{\\mathrm\{warm\}\}then

19:foreach prompt

xxwith group pass rate

0do

20:Extract reasoning prefixes

\{pj\}\\\{p\_\{j\}\\\}from the generated responses

21:Compute each prefix’s Mean Token Entropy

ℋj\\mathcal\{H\}\_\{j\}and embedding

eje\_\{j\}
22:Store

\(pj,ℋj,ej\)\(p\_\{j\},\\mathcal\{H\}\_\{j\},e\_\{j\}\)in

ℳp​\(x\)\\mathcal\{M\}\_\{\\mathrm\{p\}\}\(x\), keeping at most the configured prefix\-memory budget

23:Smooth entropy scores over the current and cached prefixes by embedding similarity as in Eq\.[4\.3](https://arxiv.org/html/2606.24994#S4.Ex4)

24:Select

p∗=arg⁡minj⁡ℋ~jp^\{\*\}=\\arg\\min\_\{j\}\\tilde\{\\mathcal\{H\}\}\_\{j\}
25:Enqueue the guided prompt

\[x,p∗\]\[x,p^\{\*\}\]into

𝒬\\mathcal\{Q\}
26:endfor

27:endif

28:endfor

29:returnupdated policy

πθ\\pi\_\{\\theta\}

## Appendix BImplementation and Experimental Details

### B\.1GRPO Objective

For completeness, we write the GRPO objective used in our implementation\. For each prompt𝐱\\mathbf\{x\}, a group ofGGresponses\{𝐲i\}i=1G\\\{\\mathbf\{y\}\_\{i\}\\\}\_\{i=1\}^\{G\}is sampled from the old policyπθold\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\. The policy is updated by maximizing

𝒥GRPO​\(θ\)=𝔼​\[1G​∑i=1Gmin⁡\(ρi​\(θ\)​A^i,clip​\(ρi​\(θ\),1−ϵ,1\+ϵ\)​A^i\)−β​𝔻KL​\(πθ∥πref\)\],\\mathcal\{J\}\_\{\\mathrm\{GRPO\}\}\(\\theta\)=\\mathbb\{E\}\\left\[\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}\\min\\\!\\left\(\\rho\_\{i\}\(\\theta\)\\hat\{A\}\_\{i\},\\mathrm\{clip\}\\bigl\(\\rho\_\{i\}\(\\theta\),1\-\\epsilon,1\+\\epsilon\\bigr\)\\hat\{A\}\_\{i\}\\right\)\-\\beta\\,\\mathbb\{D\}\_\{\\mathrm\{KL\}\}\(\\pi\_\{\\theta\}\\\|\\pi\_\{\\mathrm\{ref\}\}\)\\right\],whereρi​\(θ\)=πθ​\(𝐲i\|𝐱\)/πθold​\(𝐲i\|𝐱\)\\rho\_\{i\}\(\\theta\)=\\pi\_\{\\theta\}\(\\mathbf\{y\}\_\{i\}\|\\mathbf\{x\}\)/\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(\\mathbf\{y\}\_\{i\}\|\\mathbf\{x\}\)is the importance ratio,ϵ\\epsilonis the clipping parameter, andβ\\betacontrols the KL penalty\.

### B\.2Training Configuration

We follow the configuration described in Section[5\.1](https://arxiv.org/html/2606.24994#S5.SS1); full hyperparameters are listed in Table[7](https://arxiv.org/html/2606.24994#A2.T7)\. The novelty bonus is added after GRPO group normalization, as described in Eq\.[2](https://arxiv.org/html/2606.24994#S4.E2)\. This preserves the reward normalization used by GRPO and prevents the novelty score from changing the group mean or variance\.

### B\.3Novelty Memory and Embedding Details

For each prompt, the correct\-response memory stores embeddings of recent verified\-correct responses\. Embeddings are normalized before cosine similarity is computed\. In our runs, the relevant cosine similarities are empirically nonnegative; the theoretical analysis states this as an assumption, so the intrinsic bonus lies in\[0,1\]\[0,1\]\. Unless otherwise stated, the memory keeps theG=6G=6most recent correct responses per prompt, matching Tab\.[7](https://arxiv.org/html/2606.24994#A2.T7)\.

### B\.4Guided Prompts and Loss Masking

Prefix regeneration constructs a guided prompt by concatenating the original problem statement with the selected prefix\. The regenerated sample consists only of the newly generated suffix\. For reward computation, we verify the full concatenated trajectory consisting of the fixed prefix and the generated suffix\. For the policy update, the fixed prefix is treated as context and the GRPO loss is applied only to the suffix tokens generated by the current policy\. This masking avoids training the model to imitate cached prefixes while still allowing the prefix to guide exploration toward useful intermediate states\.

### B\.5Prefix Queue and Memory

Regeneration is disabled during the firstTwarm=30T\_\{\\mathrm\{warm\}\}=30steps\. After warmup, only all\-incorrect groups are eligible for prefix regeneration\. For each eligible prompt, we cache extracted prefixes with their MTE values and embeddings, retaining at most 128 prefixes per prompt\. Entropy smoothing is performed within the same prompt over the current prefixes and the cached prefix memory\. Guided prompts are placed in a FIFO queue with maximum size 4,096 and replace the tail of a later training batch\.

### B\.6Evaluation Configuration

Evaluation uses temperature0\.70\.7, top\-p=0\.9p=0\.9, andn=16n=16samples per problem\. We use a maximum generation budget of 31,744 tokens per response to avoid truncating long reasoning chains\. pass@1 is estimated as the mean single\-sample accuracy over the sampled responses\. pass@16 is the fraction of problems for which at least one of the 16 sampled solutions is correct\. We use the same answer\-extraction and verification pipeline for all methods\.

Table 7:Hyperparameters used in the main experiments\.

## Appendix CPrefix\-Value Evaluation Details

### C\.1Monte Carlo Prefix Evaluation Protocol

Section[5\.6](https://arxiv.org/html/2606.24994#S5.SS6)evaluates whether Mean Token Entropy \(MTE\) can identify promising intermediate reasoning states\. The experiment estimates the value of a prefix by sampling continuations from it and measuring the resulting pass rate\.

- •Problem selection:22 MATH\-500 problems at difficulty levels 4–5 where Qwen3\-1\.7B produces at least one correct and one incorrect full trajectory\.
- •Initial trajectories:32 full trajectories per problem, sampled at temperature 1\.0\.
- •Prefix extraction:Each trajectory is truncated at 20%, 40%, 60%, and 80% of its reasoning\-step count\.
- •Continuations:For each prefix, we sampleK=16K=16continuations using temperature0\.70\.7and top\-p=0\.9p=0\.9\.
- •Total data:After parsing, filtering, and per\-problem balancing, 600 evaluated prefix–value pairs, corresponding to 9,600 sampled continuations in total\.
- •Correlation:We report within\-problem Spearman correlation, averaged using Fisherzz\-transformation, and pooled Spearman correlation over all 600 prefixes\.

Within\-problem correlation is the primary statistic because regeneration selects among prefixes belonging to the same prompt\. Pooled correlation is reported only as a secondary diagnostic because it is confounded by problem\-level difficulty\.

### C\.2Full Prefix Metric Correlation Table

Table[8](https://arxiv.org/html/2606.24994#A3.T8)reports all 21 evaluated prefix metrics\. Metrics are grouped by whether they require access to the ground\-truth answer\. MTE is the strongest within\-problem predictor even though it is unsupervised and computed from rollout logits\.

Table 8:Spearman correlation between prefix metrics and Monte Carlo estimated pass rate over 600 prefixes from 22 problems\.

## Appendix DAdditional Analysis

### D\.1Post\-Normalization Novelty Application

A key design choice in ExTra is to add the novelty reward after GRPO’s group\-wise normalization rather than adding it to the raw reward before normalization\. Throughout this section, we assume the empirical regime observed in our experiments: cosine similarities used in Eq\.[1](https://arxiv.org/html/2606.24994#S4.E1)lie in\[0,1\]\[0,1\], and therefore the correctness\-gated novelty bonusbib\_\{i\}lies in\[0,1\]\[0,1\]\. LetRi∈\{0,1\}R\_\{i\}\\in\\\{0,1\\\}be the verifiable reward\. If novelty is added before normalization, the advantage becomes

A^ipre=\(Ri\+γ​bi\)−1G​∑j\(Rj\+γ​bj\)Stdj⁡\(Rj\+γ​bj\)\.\\hat\{A\}\_\{i\}^\{\\mathrm\{pre\}\}=\\frac\{\(R\_\{i\}\+\\gamma b\_\{i\}\)\-\\frac\{1\}\{G\}\\sum\_\{j\}\(R\_\{j\}\+\\gamma b\_\{j\}\)\}\{\\operatorname\{Std\}\_\{j\}\(R\_\{j\}\+\\gamma b\_\{j\}\)\}\.This couples novelty to the normalization statistics\. A shared novelty offset is removed by mean subtraction, while variation inbib\_\{i\}changes the denominator and can shrink or distort the relative contribution of the binary task reward\.

Post\-normalization application avoids this coupling:

A^ipost=Ri−μRσR\+γ​bi,\\hat\{A\}\_\{i\}^\{\\mathrm\{post\}\}=\\frac\{R\_\{i\}\-\\mu\_\{R\}\}\{\\sigma\_\{R\}\}\+\\gamma b\_\{i\},with the zero\-variance convention used for the GRPO term\. The standard reward normalization is preserved, andγ\\gammahas a stable interpretation: it is the additional advantage, in normalized units, assigned to a maximally novel correct solution relative to a minimally novel one\. Since novelty does not enter the group mean or variance, it supplements the task reward instead of changing how the task reward is standardized\.

### D\.2Computational Overhead

ExTra adds computation in two places: prefix extraction for regeneration and response embedding for semantic novelty\. Table[9](https://arxiv.org/html/2606.24994#A4.T9)reports wall\-clock overhead per step over 100 training steps\.

Table 9:Per\-step wall\-clock overhead relative to GRPO, reported as mean±\\pmstandard deviation over 100 training steps\. The exploration phase is unique to ExTra and includes prefix memory updates and novelty computation\.Embedding novelty gives the best pass@16 performance, but it is also the most expensive option\. N\-gram novelty roughly halves the overhead, but as shown in Table[6](https://arxiv.org/html/2606.24994#S5.T6), it loses substantial inference\-time coverage\. We therefore use embedding novelty as the default and treat lexical novelty as a lower\-cost alternative when throughput is the primary constraint\.

### D\.3Scope and Composition with Other Methods

Our experiments focus on mathematical reasoning with 1–2B\-scale models\. The entropy heuristic is likely to transfer best to domains where correct intermediate reasoning reduces token\-level uncertainty\. Code generation may share this property, while factual QA may be more challenging because fluent but incorrect chains can still have low entropy\.

ExTra is orthogonal to batch\-filtering methods such as DAPO\. DAPO changes which groups contribute to the gradient update; ExTra changes how trajectories are rewarded and regenerated within the batch\. A natural extension is to combine the two: filtering can remove low\-value updates, while ExTra can broaden the solution distribution inside the retained groups\.

## Appendix ETheoretical Details

This section gives local arguments supporting the design of ExTra\. The claims are intentionally modest: they do not prove convergence of LLM RL, but they isolate why the proposed mechanisms are useful under binary rewards and group\-relative normalization\.

### E\.1Homogeneous GRPO Degeneracy

#### Lemma A\.1\.

Fix a prompt𝒙\{\\bm\{x\}\}and a GRPO group\{𝒚i\}i=1G\\\{\{\\bm\{y\}\}\_\{i\}\\\}\_\{i=1\}^\{G\}with scalar rewardsri=cr\_\{i\}=cfor everyii\. Let

ℒgrp​\(θ\)=1G​∑i=1Gmin⁡\(ρi​\(θ\)​A^i,clip​\(ρi​\(θ\),1−ϵ,1\+ϵ\)​A^i\)\\mathcal\{L\}\_\{\\mathrm\{grp\}\}\(\\theta\)=\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}\\min\\\!\\left\(\\rho\_\{i\}\(\\theta\)\\hat\{A\}\_\{i\},\\mathrm\{clip\}\(\\rho\_\{i\}\(\\theta\),1\-\\epsilon,1\+\\epsilon\)\\hat\{A\}\_\{i\}\\right\)be the reward\-driven part of the clipped GRPO surrogate\. Standard GRPO setsA^i=0\\hat\{A\}\_\{i\}=0whenσr=0\\sigma\_\{r\}=0to avoid division by zero; we adopt this zero\-variance convention\. Under it,A^i=0\\hat\{A\}\_\{i\}=0for allii, and thereforeℒgrp​\(θ\)=0\\mathcal\{L\}\_\{\\mathrm\{grp\}\}\(\\theta\)=0and∇θℒgrp​\(θ\)=0\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{grp\}\}\(\\theta\)=0for everyθ\\theta\. If a KL penalty is used, only that separate regularizer remains\.

#### Proof\.

Since all rewards equalcc, their group mean isμr=c\\mu\_\{r\}=cand every centered reward is zero\. The zero\-variance convention sets all standardized advantages to zero\. SubstitutingA^i=0\\hat\{A\}\_\{i\}=0into the surrogate, bothρi​\(θ\)​A^i=0\\rho\_\{i\}\(\\theta\)\\hat\{A\}\_\{i\}=0andclip​\(ρi​\(θ\),1−ϵ,1\+ϵ\)​A^i=0\\mathrm\{clip\}\(\\rho\_\{i\}\(\\theta\),1\-\\epsilon,1\+\\epsilon\)\\hat\{A\}\_\{i\}=0, so every minimum evaluates to zero\. Hence the reward\-driven surrogate is constant inθ\\thetaand has zero gradient\.□\\square

### E\.2Sign Correctness of Post\-Normalization Novelty

With binary rewardsRi∈\{0,1\}R\_\{i\}\\in\\\{0,1\\\}, novelty bonusbi∈\[0,1\]b\_\{i\}\\in\[0,1\], and correctness gatebi=0b\_\{i\}=0wheneverRi=0R\_\{i\}=0, ExTra uses

A^ifinal=A^iGRPO\+γ​bi,γ≥0\.\\hat\{A\}\_\{i\}^\{\\mathrm\{final\}\}=\\hat\{A\}\_\{i\}^\{\\mathrm\{GRPO\}\}\+\\gamma b\_\{i\},\\qquad\\gamma\\geq 0\.This construction has three useful properties\. First, in any mixed group, standardization preserves the strict ordering between correct and incorrect responses, and the correctness gate ensures that incorrect responses receive no novelty bonus\. Therefore a correct response cannot be ranked below an incorrect one because of novelty\. Second, in an all\-correct group, the GRPO advantage is zero by Lemma A\.1, so the novelty term becomes the only source of within\-group advantage\. Third, in an all\-incorrect group, both the GRPO advantage and the gated novelty bonus are zero, so ExTra does not reward diverse failures\.

This is a sanity check on the construction rather than a formal diversity guarantee; the diversity effect is evaluated empirically in Sec\.[5\.4](https://arxiv.org/html/2606.24994#S5.SS4)and Sec\.[5\.7](https://arxiv.org/html/2606.24994#S5.SS7)\.

### E\.3Novelty as Correct\-Mode Coverage

#### Proposition A\.2\.

Fix a prompt𝒙\{\\bm\{x\}\}\. Suppose correct trajectories partition into latent semantic modes𝒞1,…,𝒞M\\mathcal\{C\}\_\{1\},\\ldots,\\mathcal\{C\}\_\{M\}, and supposeΦ​\(𝒙\)\\Phi\(\{\\bm\{x\}\}\)contains one representative embedding for each already discovered correct mode and no representatives of undiscovered modes\. Assume ideal separation under cosine similarity: two correct trajectories in the same mode have similarity11, while trajectories in different modes have similarity0\. Let

C​\(Φ\)=\|\{m:Φ​\(𝒙\)​contains a representative from​𝒞m\}\|C\(\\Phi\)=\\left\|\\\{m:\\Phi\(\{\\bm\{x\}\}\)\\text\{ contains a representative from \}\\mathcal\{C\}\_\{m\}\\\}\\right\|be the number of covered correct modes, and letΦ\+\\Phi^\{\+\}be the memory after adding a newly sampled trajectory𝒚\{\\bm\{y\}\}if it is correct\. Hererintrawr\_\{\\mathrm\{int\}\}^\{\\mathrm\{raw\}\}denotes the*pre\-normalization*intrinsic reward of Eq\.[1](https://arxiv.org/html/2606.24994#S4.E1), distinct from the post\-normalization advantage contributionγ​rint\(i\)\\gamma\\,r\_\{\\mathrm\{int\}\}^\{\(i\)\}used in the algorithm\. Withmax⁡∅=0\\max\\varnothing=0,

rintraw​\(𝒙,𝒚;Φ\)=𝟙\[R​\(𝒙,𝒚\)\>0\]​\(1−max𝒛′∈Φ​\(𝒙\)⁡cos⁡\(h​\(𝒚\),𝒛′\)\)=C​\(Φ\+\)−C​\(Φ\)\.r\_\{\\mathrm\{int\}\}^\{\\mathrm\{raw\}\}\(\{\\bm\{x\}\},\{\\bm\{y\}\};\\Phi\)=\\mathds\{1\}\_\{\[R\(\{\\bm\{x\}\},\{\\bm\{y\}\}\)\>0\]\}\\left\(1\-\\max\_\{\{\\bm\{z\}\}^\{\\prime\}\\in\\Phi\(\{\\bm\{x\}\}\)\}\\cos\(h\(\{\\bm\{y\}\}\),\{\\bm\{z\}\}^\{\\prime\}\)\\right\)=C\(\\Phi^\{\+\}\)\-C\(\\Phi\)\.

#### Proof\.

If𝒚\{\\bm\{y\}\}is incorrect, the correctness gate makes the intrinsic reward zero and the number of covered correct modes does not change\. If𝒚\{\\bm\{y\}\}is correct and belongs to an already covered mode, the maximum cosine similarity to the stored representative of that mode is11, so the intrinsic reward is zero and coverage does not increase\. If𝒚\{\\bm\{y\}\}is correct and belongs to an uncovered mode, all stored representatives are from different modes and have cosine similarity0; the intrinsic reward is11, and adding𝒚\{\\bm\{y\}\}increases the coverage count by exactly one\.□\\square

### E\.4Regeneration Efficiency

#### Proof of Proposition[1](https://arxiv.org/html/2606.24994#Thmproposition1)\(restated\)\.

We first restate the claim for a fixed policy: withp0p\_\{0\}the scratch\-sampling success probability andq​\(𝒑\)q\(\{\\bm\{p\}\}\)the conditional success probability of the suffix given prefix𝒑\{\\bm\{p\}\}, at least one length\-ttprefix in the policy support satisfiesq​\(𝒑\)≥p0q\(\{\\bm\{p\}\}\)\\geq p\_\{0\}; for any𝒑∗\{\\bm\{p\}\}^\{\*\}withq​\(𝒑∗\)\>p0q\(\{\\bm\{p\}\}^\{\*\}\)\>p\_\{0\},Pregen​\(K\)\>Pscratch​\(K\)P\_\{\\mathrm\{regen\}\}\(K\)\>P\_\{\\mathrm\{scratch\}\}\(K\)for everyK≥1K\\geq 1, and the expected number of attempts to obtain a correct trajectory falls from1/p01/p\_\{0\}to1/q​\(𝒑∗\)1/q\(\{\\bm\{p\}\}^\{\*\}\)\.

By the law of total probability, conditioning on the length\-ttprefix of a scratch rollout gives

p0=∑𝒑∈𝒱tPrπθ⁡\[𝒚≤t=𝒑∣𝒙\]​q​\(𝒑\)=𝔼𝒑∼πθ\(⋅\|𝒙\)​\[q​\(𝒑\)\]\.p\_\{0\}=\\sum\_\{\{\\bm\{p\}\}\\in\\mathcal\{V\}^\{t\}\}\\Pr\_\{\\pi\_\{\\theta\}\}\[\{\\bm\{y\}\}\_\{\\leq t\}=\{\\bm\{p\}\}\\mid\{\\bm\{x\}\}\]q\(\{\\bm\{p\}\}\)=\\mathbb\{E\}\_\{\{\\bm\{p\}\}\\sim\\pi\_\{\\theta\}\(\\cdot\|\{\\bm\{x\}\}\)\}\[q\(\{\\bm\{p\}\}\)\]\.Thusp0p\_\{0\}is the prefix\-weighted mean ofq​\(⋅\)q\(\\cdot\)\. Therefore at least one prefix satisfiesq​\(𝒑\)≥p0q\(\{\\bm\{p\}\}\)\\geq p\_\{0\}, with strict inequality unlessq​\(⋅\)q\(\\cdot\)is constant over the support\.

Now fix a selected prefix𝒑∗\{\\bm\{p\}\}^\{\*\}withq​\(𝒑∗\)\>p0q\(\{\\bm\{p\}\}^\{\*\}\)\>p\_\{0\}\. For any strategy with per\-attempt success probabilitypp, the probability that allKKindependent attempts fail is\(1−p\)K\(1\-p\)^\{K\}, so the probability of at least one success is1−\(1−p\)K1\-\(1\-p\)^\{K\}\. Substitutingp=p0p=p\_\{0\}andp=q​\(𝒑∗\)p=q\(\{\\bm\{p\}\}^\{\*\}\)gives

Pregen​\(K\)=1−\(1−q​\(𝒑∗\)\)K\>1−\(1−p0\)K=Pscratch​\(K\)P\_\{\\mathrm\{regen\}\}\(K\)=1\-\(1\-q\(\{\\bm\{p\}\}^\{\*\}\)\)^\{K\}\>1\-\(1\-p\_\{0\}\)^\{K\}=P\_\{\\mathrm\{scratch\}\}\(K\)for everyK≥1K\\geq 1\. The expected waiting time until the first success in an independent Bernoulli sequence with success probabilityp\>0p\>0is1/p1/p, so regeneration reduces the expected number of attempts from1/p01/p\_\{0\}to1/q​\(𝒑∗\)1/q\(\{\\bm\{p\}\}^\{\*\}\)\. Finally, fixing the firsttttokens reduces the reachable trajectory space from𝒱L\\mathcal\{V\}^\{L\}to𝒱L−t\\mathcal\{V\}^\{L\-t\}because only the suffix remains to be sampled\. This last statement is a combinatorial search\-space comparison; the probability of success is governed by the conditional success rateq​\(𝒑∗\)q\(\{\\bm\{p\}\}^\{\*\}\)\.□\\square

Similar Articles

Read the Trace, Steer the Path: Trajectory-Aware Reinforcement Learning for Diffusion Language Models

arXiv cs.CL

This paper introduces CAPR (Cached-Amortized Path Refinement), a reinforcement learning algorithm for diffusion large language models that extracts tree-like supervision signals from the denoising trace without the compute cost of full tree rollouts. CAPR achieves state-of-the-art performance on reasoning benchmarks like GSM8K, Math500, Sudoku, and Countdown at roughly 0.75x the cost of flat rollouts.