Shattering the Autoregressive Curse: Dynamic Epistemic Entropy Orchestrated Erasable Reinforcement Learning for LLMs

arXiv cs.AI Papers

Summary

This paper proposes E³RL, a reinforcement learning method that uses dynamic epistemic entropy thresholds to enable LLMs to excise local logical defects during generation, overcoming the autoregressive curse in long-horizon reasoning and achieving state-of-the-art results on mathematical reasoning benchmarks like AIME.

arXiv:2606.17735v1 Announce Type: new Abstract: Although reinforcement learning (RL) has expanded the cognitive boundaries of large language models (LLMs), it often remains vulnerable to the autoregressive curse in long-horizon logical reasoning: small epistemic perturbations introduced early in generation can propagate irreversibly along the Markov decision process flow, triggering cascading failures that drive the reasoning trajectory toward collapse. To overcome this autoregressive cascade, in which a single early mistake can compromise all subsequent reasoning steps, we propose dynamic epistemic entropy orchestrated erasable reinforcement learning ($\text{E}^3\text{RL}$). $\text{E}^3\text{RL}$ eliminates reliance on external signals by grounding the model's endogenous local autoregressive cross-entropy as an intrinsic coordinate of epistemic uncertainty. By introducing segment-level adaptive dynamic thresholds and advantage allocation, $\text{E}^3\text{RL}$ enables the model to precisely excise localized logical defects while reusing historical key-value (KV) cache streams, thereby endowing the reasoning process with a self-healing capability. We train $\text{E}^3\text{RL}$ on the DeepMath-103k dataset. Experimental results show that $\text{E}^3\text{RL}$ reshapes the exploration efficiency of long-sequence reasoning and improves sample efficiency while maintaining linear memory overhead. On mathematical reasoning benchmarks such as AIME, $\text{E}^3\text{RL}$ achieves substantial performance gains, with the 4B and 8B parameter models surpassing previous state-of-the-art (SOTA) results by 5.349\% and 6.514\%, respectively. These findings suggest that $\text{E}^3\text{RL}$ shatters the autoregressive curse in long-sequence reasoning and establishes a theoretical and systems-level foundation for the next generation of self-healing artificial general intelligence (AGI).
Original Article
View Cached Full Text

Cached at: 06/17/26, 05:38 AM

# Shattering the Autoregressive Curse: Dynamic Epistemic Entropy Orchestrated Erasable Reinforcement Learning for LLMs
Source: [https://arxiv.org/html/2606.17735](https://arxiv.org/html/2606.17735)
Ziliang Wang1∗‡, Kang An1,2∗†, Faqiang Qian1∗, Jialu Cai1,Cijun Ouyang1,Yuhang Wang1‡,Qibing Ren2§,Yichao Wu1§ 1SenseTime2Shanghai Jiao Tong University \{wangziliang1,wangyuhang,wuyichao\}@sensetime\.com ankang@gml\.ac\.cn, renqibing@sjtu\.edu\.cn

###### Abstract

Although reinforcement learning \(RL\) has expanded the cognitive boundaries of large language models \(LLMs\), it often remains vulnerable to the autoregressive curse in long\-horizon logical reasoning: small epistemic perturbations introduced early in generation can propagate irreversibly along the Markov decision process flow, triggering cascading failures that drive the reasoning trajectory toward collapse\. To overcome this autoregressive cascade, in which a single early mistake can compromise all subsequent reasoning steps, we propose dynamic epistemic entropy orchestrated erasable reinforcement learning \(E3​RL\\text\{E\}^\{3\}\\text\{RL\}\)\.E3​RL\\text\{E\}^\{3\}\\text\{RL\}eliminates reliance on external signals by grounding the model’s endogenous local autoregressive cross\-entropy as an intrinsic coordinate of epistemic uncertainty\. By introducing segment\-level adaptive dynamic thresholds and advantage allocation,E3​RL\\text\{E\}^\{3\}\\text\{RL\}enables the model to precisely excise localized logical defects while reusing historical key\-value \(KV\) cache streams, thereby endowing the reasoning process with a self\-healing capability\. We train E³RL on the DeepMath\-103k dataset\. Experimental results show thatE3​RL\\text\{E\}^\{3\}\\text\{RL\}reshapes the exploration efficiency of long\-sequence reasoning and improves sample efficiency while maintaining linear memory overhead\. On mathematical reasoning benchmarks such as AIME,E3​RL\\text\{E\}^\{3\}\\text\{RL\}achieves substantial performance gains, with the 4B and 8B parameter models surpassing previous state\-of\-the\-art \(SOTA\) results by 5\.349% and 6\.514%, respectively\. These findings suggest thatE3​RL\\text\{E\}^\{3\}\\text\{RL\}shatters the autoregressive curse in long\-sequence reasoning and establishes a theoretical and systems\-level foundation for the next generation of self\-healing artificial general intelligence \(AGI\)\.

![Refer to caption](https://arxiv.org/html/2606.17735v1/x1.png)Figure 1:Performance of different RL strategies\.
## 1Introduction

Propelled by the reinforcement learning\(liu2025reinforcement;zhang2025survey\)driven autoregressive generation paradigm\(wang2026multimodal\), large language models \(LLMs\) map complex problem\-solving into highly coherent token sequences prediction, achieving a substantial leap in cognitive reasoning capabilities\(zhang2026consistent;wan2026srpo;cheng2026revisiting\)\. However, as reasoning trajectories extend into deep and long contexts\(wang2025offline11;chen2025reinforcement11\), this unidirectional generation paradigm\(hou2025thinkprune11\), which strictly follows the temporal causal arrow\(chen2026learning\), exposes a critical underlying structural defect\. We refer to this defect as the autoregressive curse\. Devoid of spatiotemporal rollback and local error\-correction operators, early\-stage epistemic perturbations\(song2026large\)are unconditionally accepted and exponentially amplified along the Markov decision flow\(zhu2026edis\)\. This amplification ultimately and inevitably drives the entire high\-dimensional reasoning trajectory into a catastrophic cascading collapse\(shen2026reasoning;gao2026unlocking\)\.

Both academia\(zhu2024yulan;yang2026incoder\)and industry\(team2026kimi;yang2026iquest\)have long been trapped in an exceedingly costly technical misconception by attempting to utilize external systems to patch the foundational unidirectional generation defect of LLMs\(zheng2025stepsearch11;wang2025erase11\)\. Introducing an external process reward model \(PRM\)\(lightman2024let;zheng2025survey;pronesti2026beyond\)for step\-by\-step scoring is not only limited by exorbitant data annotation costs but also, when confronting the dynamically evolving state space of large models, static evaluation networks are inevitably trapped in severe distribution shifts, consequently triggering system\-level reward hacking\(tiwari2026reward;wang2026reward11\)\. Alternatively, global resampling methods\(kobayashi2026flexible\)must wait until the complete generation of the long\-tail sequence to execute discrimination and rejection\(dalal2026more\)\. This approach indiscriminately flushes entire tensor sequences containing numerous correct prefixes solely due to isolated local computational deviations\(yu2026curiosity\)\. This indiscriminate flushing not only disrupts the fine\-grained credit assignment mechanisms\(guo2026segment;fang2026proximity;yang2026int\)of reinforcement learning but also induces an exponential explosion in computational and memory overhead\(kim2026spend;zhang2026resource\), rendering efficient scaling on hundred\-billion\-parameter models infeasible\.

To shatter the autoregressive curse, we propose dynamic epistemic entropy orchestrated erasable reinforcement learning \(E3​RL\\text\{E\}^\{3\}\\text\{RL\}\)\. By introducing a non\-Markovian erasure operator, this method restructures the gradient flow pathways of reinforcement learning\.E3​RL\\text\{E\}^\{3\}\\text\{RL\}entirely eliminates external dependencies by physicalizing the endogenous local autoregressive cross\-entropy\(cui2025entropy;wang2026beyond\)of the model into coordinates that explicitly represent epistemic uncertainty\. By introducing segment\-level adaptive dynamic thresholds and advantage allocation,E3​RL\\text\{E\}^\{3\}\\text\{RL\}enables the model to precisely excise localized logical defects while reusing historical key\-value \(KV\) cache streams\(liu2026adaptive;chen2026arborkv\)\. In this way, it prevents defective logic from propagating through the reasoning chain and avoids inference deadlocks\. Trained on the DeepMath\-103k\(he2025deepmath\)dataset, experimental results demonstrate thatE3​RL\\text\{E\}^\{3\}\\text\{RL\}effectively optimizes the exploration efficiency of long\-sequence reasoning and significantly improves sample efficiency\. While maintaining an exceptionally low linear memory footprint, the proposed method achieves substantial performance improvements across competitive benchmarks including AMC 2023\(amc2023\_12a\_aops;amc2023\_12b\_aops\), AIME 2024\(aime2024\_i\_aops;aime2024\_ii\_aops\), AIME 2025\(aime2025\_i\_aops;aime2025\_ii\_aops\), AIME 2026\(aime2026\_i\_aops;aime2026\_ii\_aops\), Math500\(lightman2023let\), Minerva\(lewkowycz2022solving\), and OlympiadBench\(he2024olympiadbench\)\. Notably, models at the 4B and 8B parameter\(yang2025qwen3\)scales outperform previous state\-of\-the\-art results by 5\.349% and 6\.514%, respectively\. These findings indicate thatE3​RL\\text\{E\}^\{3\}\\text\{RL\}shatters the autoregressive curse inherent in long\-sequence reasoning, thereby establishing a theoretical and systemic cornerstone for next\-generation artificial general intelligence \(AGI\)\.

## 2Preliminary

Let𝒱\\mathcal\{V\}denote a finite discrete vocabulary with cardinality\|𝒱\|=V\|\\mathcal\{V\}\|=V\. Given an input promptx∈𝒱∗x\\in\\mathcal\{V\}^\{\*\}, an autoregressive language modelpθp\_\{\\theta\}maps the prompt to an output sequencey=\(y1,y2,…,yT\)∈𝒱Ty=\(y\_\{1\},y\_\{2\},\\ldots,y\_\{T\}\)\\in\\mathcal\{V\}^\{T\}, whereT∈ℕ\+T\\in\\mathbb\{N\}^\{\+\}denotes the sequence length andθ∈Θ⊆ℝdθ\\theta\\in\\Theta\\subseteq\\mathbb\{R\}^\{d\_\{\\theta\}\}denotes the model parameters\(ji2026survey\)\. The conditional distribution over the output sequence follows the standard autoregressive factorization\(jafari2025closed\):

pθ​\(y∣x\)=∏t=1Tpθ​\(yt∣y<t,x\),y<t:=\(y1,…,yt−1\)\.p\_\{\\theta\}\(y\\mid x\)=\\prod\_\{t=1\}^\{T\}p\_\{\\theta\}\(y\_\{t\}\\mid y\_\{<t\},x\),\\qquad y\_\{<t\}:=\(y\_\{1\},\\ldots,y\_\{t\-1\}\)\.\(1\)
At each generation steptt, the conditional distribution is obtained from the hidden stateht∈ℝdhh\_\{t\}\\in\\mathbb\{R\}^\{d\_\{h\}\}via a softmax layer\(zheng2025mmrpt;zhou2026look\):

pθ​\(yt=v∣y<t,x\)=exp⁡\(ht⊤​ev/γ\)∑v′∈𝒱exp⁡\(ht⊤​ev′/γ\),∀v∈𝒱,p\_\{\\theta\}\(y\_\{t\}=v\\mid y\_\{<t\},x\)=\\frac\{\\exp\\left\(h\_\{t\}^\{\\top\}e\_\{v\}/\\gamma\\right\)\}\{\\sum\_\{v^\{\\prime\}\\in\\mathcal\{V\}\}\\exp\\left\(h\_\{t\}^\{\\top\}e\_\{v^\{\\prime\}\}/\\gamma\\right\)\},\\qquad\\forall v\\in\\mathcal\{V\},\(2\)whereev∈ℝdhe\_\{v\}\\in\\mathbb\{R\}^\{d\_\{h\}\}is the output embedding of tokenvv,γ\>0\\gamma\>0is the temperature parameter, and the hidden state is computed by a Transformer decoder\(su2026learning\):

ht=Transformerθ​\(x,y<t\)\.h\_\{t\}=\\mathrm\{Transformer\}\_\{\\theta\}\(x,y\_\{<t\}\)\.\(3\)
Equivalently, the model can be viewed as a stochastic policyπθ\(⋅∣x,y<t\)\\pi\_\{\\theta\}\(\\cdot\\mid x,y\_\{<t\}\)over the vocabulary\(qian2025uniapl\)\. Once a tokenyty\_\{t\}is sampled or decoded, it is irrevocably appended to the prefix and becomes part of the conditioning context for all subsequent decisions\(kim2026early\)\. This causal commitment is the defining property of autoregressive generation, but it also constitutes its main structural vulnerability: an early local error cannot be rolled back by the model itself and may therefore propagate through the remaining reasoning trajectory\(huang2026not\)\.

To formalize this effect, let

yt⋆=arg⁡maxv∈𝒱⁡pθ​\(v∣y<t,x\)y\_\{t\}^\{\\star\}=\\arg\\max\_\{v\\in\\mathcal\{V\}\}p\_\{\\theta\}\(v\\mid y\_\{<t\},x\)\(4\)denote the greedy decoding output at steptt, and letyt≠yt⋆y\_\{t\}\\neq y\_\{t\}^\{\\star\}be the actually sampled token\(jin2026corefine\)\. We define the cognitive perturbationϵt≥0\\epsilon\_\{t\}\\geq 0as the log\-probability gap between the locally optimal token and the sampled token:

ϵt:=log⁡pθ​\(yt⋆∣y<t,x\)−log⁡pθ​\(yt∣y<t,x\)≥0\.\\epsilon\_\{t\}:=\\log p\_\{\\theta\}\(y\_\{t\}^\{\\star\}\\mid y\_\{<t\},x\)\-\\log p\_\{\\theta\}\(y\_\{t\}\\mid y\_\{<t\},x\)\\geq 0\.\(5\)
Whenϵt=0\\epsilon\_\{t\}=0, the model selects a optimal token under its own current distribution\. Whenϵt\>0\\epsilon\_\{t\}\>0, the model deviates from the optimal action, and the magnitude ofϵt\\epsilon\_\{t\}quantifies the degree of local decision mismatch\(xu2026unveiling\)\. If such a perturbation exceeds a critical level, the subsequent generation process is forced to continue from a cognitively biased prefix\(venhoff2025understanding\)\.

The harmfulness of this perturbation lies in its non\-compressible cascading effect\(zheng2026pilot;jin2026himac\)\. Due to the multiplicative structure of the autoregressive chain, local log\-probability gaps accumulate additively in log space and therefore induce exponential distortion in probability space\(cao2026diffcot;you2025probabilistic\)\. For a prefix of lengthkk, the cumulative perturbation can be written as

E1:k=∑t=1kϵt\.E\_\{1:k\}=\\sum\_\{t=1\}^\{k\}\\epsilon\_\{t\}\.\(6\)
Consequently, even a small perturbation at an early step may affect all later conditional distributions:

pθ​\(yt\+1∣y≤t,x\),pθ​\(yt\+2∣y≤t\+1,x\),…,pθ​\(yT∣y<T,x\)\.p\_\{\\theta\}\(y\_\{t\+1\}\\mid y\_\{\\leq t\},x\),\\;p\_\{\\theta\}\(y\_\{t\+2\}\\mid y\_\{\\leq t\+1\},x\),\\ldots,p\_\{\\theta\}\(y\_\{T\}\\mid y\_\{<T\},x\)\.\(7\)
This observation motivates a segment\-level generation and correction mechanism\. Instead of treating the entire output as a single irreversible decision chain\(sharma2026prism;singh2026v\_1\), we introduce intermediate checkpoints that allow the model to monitor uncertainty, identify locally unstable reasoning segments, and erase or resample problematic segments before they contaminate the full trajectory\.

## 3Method

### 3\.1Segmented Generation

E3​RLE^\{3\}\\mathrm\{RL\}restructures conventional one\-shot autoregressive generation into an iterative segmented generation process with checkpoints\. Given an input promptxxand a pair of hyperparameters\(L,N\)\(L,N\), the output sequencey=\(y1,…,yT\)y=\(y\_\{1\},\\ldots,y\_\{T\}\)is divided intoNNnon\-overlapping segments\. AssumeT=N×LT=N\\times L, thenn\-th segment is defined as

sn=\(y\(n−1\)​L\+1,…,yn​L\),n=1,…,N\.s\_\{n\}=\\left\(y\_\{\(n\-1\)L\+1\},\\ldots,y\_\{nL\}\\right\),\\qquad n=1,\\ldots,N\.\(8\)
Lets<n:=\(s1,…,sn−1\)s\_\{<n\}:=\(s\_\{1\},\\ldots,s\_\{n\-1\}\)denote all previously generated segments\. Under the segmented formulation, the probability of thenn\-th segment is

pθ​\(sn∣x,s<n\)=∏t=\(n−1\)​L\+1n​Lpθ​\(yt∣y<t,x\)\.p\_\{\\theta\}\(s\_\{n\}\\mid x,s\_\{<n\}\)=\\prod\_\{t=\(n\-1\)L\+1\}^\{nL\}p\_\{\\theta\}\(y\_\{t\}\\mid y\_\{<t\},x\)\.\(9\)
The segmented generator decomposes the original length\-TTirreversible decision chain intoNNlocal decision windows\. This design reduces the depth of single\-step error propagation and introduces explicit checkpoints at which the model can evaluate whether the current segment should be accepted, erased, or regenerated\.

![Refer to caption](https://arxiv.org/html/2606.17735v1/x2.png)Figure 2:Overview ofE3​RL\\text\{E\}^\{3\}\\text\{RL\}\.
### 3\.2Epistemic Entropy Monitoring

For a generation positiontt, given the hidden statehth\_\{t\}, we define the token\-level cognitive entropy and its segment\-level average as

Ht=−∑v∈𝒱pθ​\(v∣ht\)​log⁡pθ​\(v∣ht\),ℋn=1L​∑t=\(n−1\)​L\+1n​LHt\.H\_\{t\}=\-\\sum\_\{v\\in\\mathcal\{V\}\}p\_\{\\theta\}\(v\\mid h\_\{t\}\)\\log p\_\{\\theta\}\(v\\mid h\_\{t\}\),\\quad\\mathcal\{H\}\_\{n\}=\\frac\{1\}\{L\}\\sum\_\{t=\(n\-1\)L\+1\}^\{nL\}H\_\{t\}\.\(10\)
To suppress short\-term entropy fluctuations and emphasize persistent uncertainty trends,E3​RLE^\{3\}\\mathrm\{RL\}introduces sliding\-window smoothing and its boundary normalization:

ℋ~n=1Cn​∑w=−WWα\|w\|​ℋn\+w,Cn=∑w=−WWα\|w\|​𝟏​\[1≤n\+w≤N\],\\widetilde\{\\mathcal\{H\}\}\_\{n\}=\\frac\{1\}\{C\_\{n\}\}\\sum\_\{w=\-W\}^\{W\}\\alpha^\{\|w\|\}\\mathcal\{H\}\_\{n\+w\},\\quad C\_\{n\}=\\sum\_\{w=\-W\}^\{W\}\\alpha^\{\|w\|\}\\mathbf\{1\}\[1\\leq n\+w\\leq N\],\(11\)whereWWis the half\-window size,α∈\(0,1\]\\alpha\\in\(0,1\]is a distance\-decay coefficient, andCnC\_\{n\}is the normalization factor near segment boundaries\. In addition to the average entropy, we extract the maximum entropy within the segment and monitor the intra\-segment entropy variation rate to capture local burst\-like cognitive crises and sharp oscillations:

ℋnmax=maxt∈\{\(n−1\)​L\+1,…,n​L\}⁡Ht,Δ​ℋn=1L−1​∑t=\(n−1\)​L\+1n​L−1\|Ht\+1−Ht\|\.\\mathcal\{H\}\_\{n\}^\{\\max\}=\\max\_\{t\\in\\\{\(n\-1\)L\+1,\\ldots,nL\\\}\}H\_\{t\},\\quad\\Delta\\mathcal\{H\}\_\{n\}=\\frac\{1\}\{L\-1\}\\sum\_\{t=\(n\-1\)L\+1\}^\{nL\-1\}\\left\|H\_\{t\+1\}\-H\_\{t\}\\right\|\.\(12\)A large value ofΔ​ℋn\\Delta\\mathcal\{H\}\_\{n\}indicates sharp cognitive oscillation within the segment\. Such instability may reveal local reasoning defects even when the average entropy remains moderate\. Finally, the comprehensive uncertainty metric of thenn\-th segment is defined as

Un=ℋ~n⏟Base Uncertainty\+λG​Δ​ℋn⏟Gradient Anomaly\+λM​σsig​\(ℋnmax−μEσE\+εE\)⏟Extremum Deviation,U\_\{n\}=\\underbrace\{\\widetilde\{\\mathcal\{H\}\}\_\{n\}\}\_\{\\text\{Base Uncertainty\}\}\+\\underbrace\{\\lambda\_\{G\}\\Delta\\mathcal\{H\}\_\{n\}\}\_\{\\text\{Gradient Anomaly\}\}\+\\underbrace\{\\lambda\_\{M\}\\sigma\_\{\\mathrm\{sig\}\}\\left\(\\frac\{\\mathcal\{H\}\_\{n\}^\{\\max\}\-\\mu\_\{E\}\}\{\\sigma\_\{E\}\+\\varepsilon\_\{E\}\}\\right\)\}\_\{\\text\{Extremum Deviation\}\},\(13\)whereλG\\lambda\_\{G\}andλM\\lambda\_\{M\}are weighting coefficients,σsig​\(⋅\)\\sigma\_\{\\mathrm\{sig\}\}\(\\cdot\)denotes the sigmoid function,μE\\mu\_\{E\}andσE\\sigma\_\{E\}are the mean and standard deviation of cognitive entropy in the current batch, andεE\>0\\varepsilon\_\{E\}\>0is a small constant for numerical stability\. The base termℋ~n\\widetilde\{\\mathcal\{H\}\}\_\{n\}measures the overall uncertainty of the segment, the gradient termΔ​ℋn\\Delta\\mathcal\{H\}\_\{n\}detects cognitive instability, and the extremum termℋnmax\\mathcal\{H\}\_\{n\}^\{\\max\}identifies local burst\-like uncertainty spikes\.

### 3\.3Segment\-Level Advantage Assignment

Under the group\-sampling framework of GRPO, each questionqqis associated withGGsampled output sequences and their segment\-level uncertainty metrics:

\{y1,y2,…,yG\}∼πθold\(⋅∣q\),𝒟ngroup=\{Un1,Un2,…,UnG\}\.\\\{y^\{1\},y^\{2\},\\ldots,y^\{G\}\\\}\\sim\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(\\cdot\\mid q\),\\quad\\mathcal\{D\}\_\{n\}^\{\\mathrm\{group\}\}=\\left\\\{U\_\{n\}^\{1\},U\_\{n\}^\{2\},\\ldots,U\_\{n\}^\{G\}\\right\\\}\.\(14\)
We compute the group mean and standard deviation as

μngroup=1G​∑i=1GUni,σngroup=1G​∑i=1G\(Uni−μngroup\)2\.\\mu\_\{n\}^\{\\mathrm\{group\}\}=\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}U\_\{n\}^\{i\},\\quad\\sigma\_\{n\}^\{\\mathrm\{group\}\}=\\sqrt\{\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}\\left\(U\_\{n\}^\{i\}\-\\mu\_\{n\}^\{\\mathrm\{group\}\}\\right\)^\{2\}\}\.\(15\)
The macro\-level dynamic baseline and its adaptive offset are then defined as

βnmacro=μngroup\+κ​\(σngroup\)​σngroup,κ​\(σ\)=κ0\+κ1​tanh⁡\(σ−σ0σ0\+εσ\)\.\\beta\_\{n\}^\{\\mathrm\{macro\}\}=\\mu\_\{n\}^\{\\mathrm\{group\}\}\+\\kappa\\\!\\left\(\\sigma\_\{n\}^\{\\mathrm\{group\}\}\\right\)\\sigma\_\{n\}^\{\\mathrm\{group\}\},\\quad\\kappa\(\\sigma\)=\\kappa\_\{0\}\+\\kappa\_\{1\}\\tanh\\left\(\\frac\{\\sigma\-\\sigma\_\{0\}\}\{\\sigma\_\{0\}\+\\varepsilon\_\{\\sigma\}\}\\right\)\.\(16\)This offset function adaptively relaxes or tightens the rejection threshold according to the variance of group\-level entropy, which reflects the ambiguity of the current problem\. At the micro level, for theene\_\{n\}\-th erasure attempt of thenn\-th segment, we introduce an exponentially increasing penalty factor and the final adaptive threshold:

Γ​\(en\)=exp⁡\(η​enδ\),Θn​\(en\)=βnmacro⋅Γ​\(en\)⋅ϕ​\(ℋ<n\),\\Gamma\(e\_\{n\}\)=\\exp\\left\(\\eta e\_\{n\}^\{\\delta\}\\right\),\\quad\\Theta\_\{n\}\(e\_\{n\}\)=\\beta\_\{n\}^\{\\mathrm\{macro\}\}\\cdot\\Gamma\(e\_\{n\}\)\\cdot\\phi\(\\mathcal\{H\}\_\{<n\}\),\(17\)whereη\>0\\eta\>0controls the penalty strength andδ\>0\\delta\>0controls the growth order\. The history\-dependent modulation function is

ϕ​\(ℋ<n\)=1\+ρ​tanh⁡\(1n−1​∑m=1n−1ℋ~m−βmmacroβmmacro\+εβ\),n\>1\.\\phi\(\\mathcal\{H\}\_\{<n\}\)=1\+\\rho\\tanh\\left\(\\frac\{1\}\{n\-1\}\\sum\_\{m=1\}^\{n\-1\}\\frac\{\\widetilde\{\\mathcal\{H\}\}\_\{m\}\-\\beta\_\{m\}^\{\\mathrm\{macro\}\}\}\{\\beta\_\{m\}^\{\\mathrm\{macro\}\}\+\\varepsilon\_\{\\beta\}\}\\right\),\\qquad n\>1\.\(18\)E3​RLE^\{3\}\\mathrm\{RL\}extends GRPO from sequence\-level optimization to segment\-level optimization\. Given the final sequence\-level rewardR​\(yi\)R\(y^\{i\}\), we use causal backtracking assignment to obtain the reward signal for thenn\-th segment\. The token attribution weight, the segment\-level reward, and its normalization factor are defined as

ati=1L′​∑t′=T−L′\+1TAttnt′→ti,Rni=R​\(yi\)Zi​∑t=\(n−1\)​L\+1n​Lati,Zi=∑τ=1Taτi,a\_\{t\}^\{i\}=\\frac\{1\}\{L^\{\\prime\}\}\\sum\_\{t^\{\\prime\}=T\-L^\{\\prime\}\+1\}^\{T\}\\mathrm\{Attn\}\_\{t^\{\\prime\}\\rightarrow t\}^\{i\},\\quad R\_\{n\}^\{i\}=\\frac\{R\(y^\{i\}\)\}\{Z^\{i\}\}\\sum\_\{t=\(n\-1\)L\+1\}^\{nL\}a\_\{t\}^\{i\},\\quad Z^\{i\}=\\sum\_\{\\tau=1\}^\{T\}a\_\{\\tau\}^\{i\},\(19\)whereL′L^\{\\prime\}is the length of the terminal attribution window andAttnt′→ti\\mathrm\{Attn\}\_\{t^\{\\prime\}\\rightarrow t\}^\{i\}denotes the attention mass from tokent′t^\{\\prime\}to tokenttin theii\-th trajectory\. The segment\-level advantage of thenn\-th segment in theii\-th sequence is defined by group normalization:

Ani=Rni−1G​∑j=1GRnj1G​∑j=1G\(Rnj−1G​∑k=1GRnk\)2\+εA\.A\_\{n\}^\{i\}=\\frac\{R\_\{n\}^\{i\}\-\\frac\{1\}\{G\}\\sum\_\{j=1\}^\{G\}R\_\{n\}^\{j\}\}\{\\sqrt\{\\frac\{1\}\{G\}\\sum\_\{j=1\}^\{G\}\\left\(R\_\{n\}^\{j\}\-\\frac\{1\}\{G\}\\sum\_\{k=1\}^\{G\}R\_\{n\}^\{k\}\\right\)^\{2\}\}\+\\varepsilon\_\{A\}\}\.\(20\)
The segmented policy optimization objective ofE3​RLE^\{3\}\\mathrm\{RL\}is formulated as

JE3​RL\(θ\)=𝔼q∼P​\(𝒬\),\{yi\}i=1G∼πθold\[1G∑i=1G∑n=1Nmin\(ρni\(θ\)Ani,\\displaystyle J\_\{E^\{3\}\\mathrm\{RL\}\}\(\\theta\)=\\mathbb\{E\}\_\{q\\sim P\(\\mathcal\{Q\}\),\\\{y^\{i\}\\\}\_\{i=1\}^\{G\}\\sim\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\}\\Bigg\[\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}\\sum\_\{n=1\}^\{N\}\\min\\Big\(\\rho\_\{n\}^\{i\}\(\\theta\)A\_\{n\}^\{i\},\(21\)clip\(ρni\(θ\),1−ϵ,1\+ϵ\)Ani\)−βDKL\(πθ∥πref\)\],\\displaystyle\\mathrm\{clip\}\\left\(\\rho\_\{n\}^\{i\}\(\\theta\),1\-\\epsilon,1\+\\epsilon\\right\)A\_\{n\}^\{i\}\\Big\)\-\\beta D\_\{\\mathrm\{KL\}\}\\left\(\\pi\_\{\\theta\}\\\|\\pi\_\{\\mathrm\{ref\}\}\\right\)\\Bigg\],where the segment\-level probability ratio is

ρni​\(θ\)=πθ​\(sni∣q,s<ni\)πθold​\(sni∣q,s<ni\)\.\\rho\_\{n\}^\{i\}\(\\theta\)=\\frac\{\\pi\_\{\\theta\}\(s\_\{n\}^\{i\}\\mid q,s\_\{<n\}^\{i\}\)\}\{\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(s\_\{n\}^\{i\}\\mid q,s\_\{<n\}^\{i\}\)\}\.\(22\)
This segment\-level credit assignment ensures that the gradient signal is applied only to the effective segments that are retained and eventually contribute to the final reasoning trajectory, thereby improving the stability of reinforcement learning\.

### 3\.4Erasable Reinforcement Learning

Given a candidate segmentsnis\_\{n\}^\{i\}from theii\-th sequence, the non\-Markovian erasure operatorℰ\\mathcal\{E\}maps the generation state to a binary decision:

ℰ:\(sni,Uni,Θn,eni\)↦\{⊤,⊥\},\\mathcal\{E\}:\\left\(s\_\{n\}^\{i\},U\_\{n\}^\{i\},\\Theta\_\{n,e\_\{n\}^\{i\}\}\\right\)\\mapsto\\\{\\top,\\bot\\\},\(23\)where⊤\\topdenotes erasure and⊥\\botdenotes acceptance\. Based on the comprehensive uncertainty metricUniU\_\{n\}^\{i\}and the dynamic thresholdΘn,eni\\Theta\_\{n,e\_\{n\}^\{i\}\}, the erasure trigger is defined as

ℰ​\(sni\)=\{⊥,Uni≤Θn,eni,accept,⊤,Uni\>Θn,eni,erase\.\\mathcal\{E\}\(s\_\{n\}^\{i\}\)=\\begin\{cases\}\\bot,&U\_\{n\}^\{i\}\\leq\\Theta\_\{n,e\_\{n\}^\{i\}\},\\quad\\text\{accept\},\\\\ \\top,&U\_\{n\}^\{i\}\>\\Theta\_\{n,e\_\{n\}^\{i\}\},\\quad\\text\{erase\}\.\\end\{cases\}\(24\)
Let the generation state before producing thenn\-th segment beSni=\(s<ni,eni\)S\_\{n\}^\{i\}=\(s\_\{<n\}^\{i\},e\_\{n\}^\{i\}\), whereenie\_\{n\}^\{i\}records the number of erasure attempts for the current segment\. The state transition is defined as

Snexti=\{\(s<ni⊕sni,0\),ℰ​\(sni\)=⊥,\(s<ni,eni\+1\),ℰ\(sni\)=⊤∧eni<Emax,\(s<ni⊕sni,0\),ℰ\(sni\)=⊤∧eni=Emax\.S\_\{\\mathrm\{next\}\}^\{i\}=\\begin\{cases\}\\left\(s\_\{<n\}^\{i\}\\oplus s\_\{n\}^\{i\},0\\right\),&\\mathcal\{E\}\(s\_\{n\}^\{i\}\)=\\bot,\\\\ \\left\(s\_\{<n\}^\{i\},e\_\{n\}^\{i\}\+1\\right\),&\\mathcal\{E\}\(s\_\{n\}^\{i\}\)=\\top\\ \\land\\ e\_\{n\}^\{i\}<E\_\{\\max\},\\\\ \\left\(s\_\{<n\}^\{i\}\\oplus s\_\{n\}^\{i\},0\\right\),&\\mathcal\{E\}\(s\_\{n\}^\{i\}\)=\\top\\ \\land\\ e\_\{n\}^\{i\}=E\_\{\\max\}\.\\end\{cases\}\(25\)
In the erasure branch, the system removes the current pathological segment, resamples it from the same prefix, and updates the erasure threshold:

sni∼πθ\(⋅∣q,s<ni\),eni←eni\+1,Θn,eni=βnmacro⋅γeni⋅ϕ\(ℋ<n\)\.s\_\{n\}^\{i\}\\sim\\pi\_\{\\theta\}\\left\(\\cdot\\mid q,s\_\{<n\}^\{i\}\\right\),\\qquad e\_\{n\}^\{i\}\\leftarrow e\_\{n\}^\{i\}\+1,\\qquad\\Theta\_\{n,e\_\{n\}^\{i\}\}=\\beta^\{\\mathrm\{macro\}\}\_\{n\}\\cdot\\gamma\_\{e\_\{n\}^\{i\}\}\\cdot\\phi\\left\(\\mathcal\{H\}\_\{<n\}\\right\)\.\(26\)
Therefore,E3​RL\\mathrm\{E\}^\{3\}\\mathrm\{RL\}introduces an explicit rollback\-and\-retry mechanism into autoregressive reinforcement learning\. Instead of forcing every sampled segment to be permanently committed, the model dynamically detects, erases, and regenerates high\-uncertainty segments before they propagate into later reasoning steps\.

## 4Experiments

### 4\.1Experimental Setup

Datasets and Benchmarks\.We selected 51k samples from the DeepMath\-103k\(he2025deepmath111\)dataset for reinforcement learning training\. To rigorously assess the reasoning capabilities of the models, we evaluate the trained policies across a spectrum of highly competitive and complex mathematical benchmarks: AMC 2023\(amc2023\_12a\_aops;amc2023\_12b\_aops\), AIME 2024\(aime2024\_i\_aops;aime2024\_ii\_aops\), AIME 2025\(aime2025\_i\_aops;aime2025\_ii\_aops\), AIME 2026\(aime2026\_i\_aops;aime2026\_ii\_aops\), MATH 500\(lightman2023let\), Minerva\(lewkowycz2022solving\), and OlympiadBench\(he2024olympiadbench\)\.

Models and Baselines\.We conduct experiments at two parameter scales based on the Qwen3 architecture: Qwen3\-4B and Qwen3\-8B\(yang2025qwen3\)\. To contextualize the performance ofE3​RL\\mathrm\{E\}^\{3\}\\mathrm\{RL\}, we compare it against several strong baselines: Vanilla, GRPO\(shao2024deepseekmath111\), DAPO\(yu2026dapo111\), GSPO\(zheng2025group111\), and SAPO\(gao2025soft111\)\.

Evaluation Metrics\.Following standard protocols for mathematical reasoning, we report theAvg@kandPass@kmetrics, settingk=32k=32\.

![Refer to caption](https://arxiv.org/html/2606.17735v1/figures/comprehensive_Modified.png)Figure 3:Training dynamics of different RL strategies\.![Refer to caption](https://arxiv.org/html/2606.17735v1/figures/merge_4b_passk_and_8b_pass_k.png)Figure 4:Pass@k scaling of different RL strategies\.MethodAMC 2023AIME 2024AIME 2025AIME 2026MATH 500MinervaOlympiadBenchAvgPassAvgPassAvgPassAvgPassAvgPassAvgPassAvgPassQwen3\-4bVanilla0\.7120\.9000\.3580\.6330\.2290\.5330\.2540\.5330\.6850\.7540\.2040\.2980\.4900\.633GRPO0\.8290\.9750\.4420\.8330\.3750\.7000\.4330\.6670\.7270\.7760\.2350\.3350\.6010\.787DAPO0\.8420\.9750\.4360\.8000\.3590\.8330\.4190\.7330\.7250\.7780\.2380\.3350\.6060\.799GSPO0\.8490\.9750\.4450\.8330\.3820\.7000\.4170\.6670\.7310\.7780\.2410\.3270\.6180\.807SAPO0\.8550\.9750\.4750\.8330\.3740\.6670\.4450\.6670\.7330\.7820\.2420\.3460\.6150\.797E3​RL\\text\{E\}^\{3\}\\text\{RL\}0\.8780\.9750\.5060\.8330\.4330\.7670\.4850\.7330\.7420\.7860\.2560\.3310\.6390\.813Qwen3\-8bVanilla0\.6700\.8750\.3490\.6000\.2150\.4670\.2530\.5330\.6680\.7500\.1950\.2860\.4720\.619GRPO0\.8310\.9750\.5190\.7670\.3990\.6330\.4580\.7330\.7370\.7760\.2400\.3460\.6290\.801DAPO0\.8460\.9500\.5380\.8000\.4140\.6670\.5080\.7670\.7380\.7800\.2390\.3310\.6370\.805GSPO0\.8420\.9500\.5210\.8330\.4090\.6330\.4820\.7330\.7450\.7820\.2410\.3460\.6360\.798SAPO0\.8430\.9750\.5070\.7670\.3940\.6000\.4660\.7670\.7430\.7860\.2460\.3510\.6390\.805E3​RL\\text\{E\}^\{3\}\\text\{RL\}0\.8870\.9750\.5750\.7670\.4480\.8000\.5130\.7330\.7580\.7860\.2530\.3570\.6540\.816Table 1:Avg@32 and Pass@32 results of different RL strategies\.
### 4\.2Main Results

Table[1](https://arxiv.org/html/2606.17735#S4.T1)summarizes the main results of different reinforcement learning strategies on seven mathematical reasoning benchmarks\. Overall,E3​RL\\text\{E\}^\{3\}\\text\{RL\}achieves the strongestAvg@32performance at both model scales, demonstrating that dynamic epistemic entropy guided erasure consistently improves the average quality of long\-horizon reasoning trajectories\.

Training Dynamics\.As shown in Figure[3](https://arxiv.org/html/2606.17735#S4.F3),E3​RL\\text\{E\}^\{3\}\\text\{RL\}achieves the highest training accuracy while maintaining a competitive finish ratio\. It produces longer responses with fewer reasoning segments, suggesting that the model learns to preserve useful prefixes and erase only unstable local fragments\. This behavior confirms that entropy\-orchestrated erasure improves exploration quality and stabilizes long\-range credit assignment\.

Pass@k Scaling\.Figure[4](https://arxiv.org/html/2606.17735#S4.F4)shows thatE3​RL\\text\{E\}^\{3\}\\text\{RL\}delivers consistently strong Pass@k curves across both Qwen3\-4B and Qwen3\-8B\. The advantage is especially visible on AIME, Minerva, and OlympiadBench, where increasing samples leads to more reliable success\. These results indicate thatE3​RL\\text\{E\}^\{3\}\\text\{RL\}improves not only average trajectory quality but also multi\-sample recoverability\.

### 4\.3Ablation Experiment

MethodAMC 2023AIME 2024AIME 2025AIME 2026MATH 500MinervaOlympiadBenchQwen3\-4bE3​RL\\text\{E\}^\{3\}\\text\{RL\}0\.8780\.5060\.4330\.4850\.7420\.2560\.639w/o extremum deviation0\.8700\.5120\.4250\.4770\.7380\.2520\.634w/o gradient anomaly0\.8670\.4980\.4270\.4740\.7350\.2510\.631w/o base uncertainty0\.8410\.4450\.4300\.4260\.7230\.2360\.605ow base uncertainty0\.8680\.5020\.4310\.4810\.7370\.2540\.636ow gradient anomaly0\.8390\.4430\.4320\.4320\.7260\.2420\.607ow extremum deviation0\.8340\.4500\.4130\.4210\.7240\.2380\.599Table 2:Ablation study on cognitive entropy components evaluated on the Avg@32 metric\. "w/o" represent "with out" while "ow" for "only with"\.MethodAMC 2023AIME 2024AIME 2025AIME 2026MATH 500MinervaOlympiadBenchQwen3\-4BE3​RL\\text\{E\}^\{3\}\\text\{RL\}0\.8780\.5060\.4330\.4850\.7420\.2560\.639w/o frequency penalty0\.8720\.5030\.4290\.4810\.7390\.2550\.637w/o causal allocation0\.8650\.4940\.4160\.4760\.7370\.2520\.634w/o group dynamics0\.8540\.4860\.3980\.4680\.7330\.2490\.625ow group dynamics0\.8630\.4920\.4120\.4730\.7400\.2530\.632ow causal allocation0\.8550\.4830\.3940\.4590\.7340\.2480\.628ow frequency penalty0\.8460\.4670\.3860\.4310\.7290\.2390\.617Table 3:Ablation study on system mechanisms evaluated on the Avg@32 metric\. "w/o" represent "with out" while "ow" for "only with"\.To deeply understand the inner workings ofE3​RL\\text\{E\}^\{3\}\\text\{RL\}and validate our theoretical design, we conduct extensive ablation studies on the Qwen3\-4B model\. We decompose the system into two primary dimensions: the components of the cognitive uncertainty metric and the structural system mechanisms\. We evaluate these variations using the rigorousAvg@32metric to observe their impact on the expected stability of the reasoning trajectories\.

Deconstructing Epistemic Entropy Monitoring\.Table[2](https://arxiv.org/html/2606.17735#S4.T2)isolates the effects of the three distinct uncertainty signals: base uncertainty \(ℋ~n\\widetilde\{\\mathcal\{H\}\}\_\{n\}\), gradient anomaly \(Δ​ℋn\\Delta\\mathcal\{H\}\_\{n\}\), and extremum deviation \(ℋnmax\\mathcal\{H\}\_\{n\}^\{\\max\}\)\. The results clearly indicate that thebase uncertaintyacts as the foundational pillar of the erasure mechanism\. Removing it \(w/o base uncertainty\) triggers catastrophic performance drops across all benchmarks, most notably plummeting from0\.4850\.485to0\.4260\.426on AIME 2026, and from0\.5060\.506to0\.4450\.445on AIME 2024\. Conversely, relying exclusively on base uncertainty \(ow base uncertainty\) retains a significant portion of the model’s performance but fails to reach the optimal state\. The gradient anomaly and extremum deviation components act as critical high\-frequency filters\. Relying on them alone \(ow gradient anomalyorow extremum deviation\) yields the poorest results, proving they are insufficient as standalone indicators of logical collapse\. However, when integrated into the fullE3​RL\\text\{E\}^\{3\}\\text\{RL\}system, they effectively detect local burst\-like cognitive crises and sharp intra\-segment oscillations, pushing the performance to its peak\.

Dissecting Structural System Mechanisms\.Table[3](https://arxiv.org/html/2606.17735#S4.T3)isolates the impact of our non\-Markovian system designs: the micro\-level frequency penalty \(Γ​\(en\)\\Gamma\(e\_\{n\}\)\), segment\-level causal allocation, and the macro\-level group dynamics \(βnmacro\\beta\_\{n\}^\{\\mathrm\{macro\}\}\)\. The most severe degradation occurs when the system operates only with frequency penalty \(ow frequency penalty\), dropping AIME 2024 to0\.4670\.467and AIME 2025 to0\.3860\.386\. This strongly supports our theoretical claim that without adaptive group dynamics to scale thresholds based on problem ambiguity, static thresholding fails to generalize across varying difficulty levels\. Furthermore, the removal of the frequency penalty \(w/o frequency penalty\) slightly degrades overall performance\. This confirms its designed role: the exponentially increasing micro\-penalty prevents the model from falling into endless resampling loops on highly difficult segments, successfully precluding reasoning deadlocks\. Finally, the ablation of causal allocation \(w/o causal allocation\) results in a broad performance decay, validating that reconstructing the gradient flow pathways to exclusively reward successfully retained logic segments fundamentally preserves the multi\-scale credit assignment system\.

![Refer to caption](https://arxiv.org/html/2606.17735v1/x3.png)Figure 5:Training dynamics ofE3​RL\\text\{E\}^\{3\}\\text\{RL\}under different Erase@k settings\.MethodAMC 2023AIME 2024AIME 2025AIME 2026MATH 500MinervaOlympiadBenchQwen3\-8Berase@10\.8550\.5430\.4190\.4840\.7470\.2390\.642erase@30\.8740\.5590\.4260\.5010\.7520\.2480\.647erase@50\.8870\.5750\.4480\.5130\.7580\.2530\.654Table 4:Erase@k results ofE3​RL\\text\{E\}^\{3\}\\text\{RL\}, evaluated using Avg@32\.MethodAMC 2023AIME 2024AIME 2025AIME 2026MATH 500MinervaOlympiadBenchQwen3\-8B32×\\times2560\.8510\.5440\.4330\.4890\.7480\.2460\.64316×\\times5120\.8960\.5830\.4520\.5250\.7610\.2540\.6628×\\times10240\.8870\.5750\.4480\.5130\.7580\.2530\.6544×\\times20480\.8750\.5680\.4410\.5060\.7500\.2490\.6492×\\times40960\.8420\.5430\.4230\.4840\.7440\.2450\.638Table 5:Study on the segments×\\timeslength configuration ofE3​RL\\text\{E\}^\{3\}\\text\{RL\}, evaluated using Avg@32\.

## 5Analysis

To further investigate the behavior of E3RL, we analyze two hyperparameters that directly control the granularity of self\-correction: the maximum number of erasure attempts and the segment\-length configuration\. The analysis focuses on Qwen3\-8B and follows the same Avg@32 evaluation protocol used in the main experiments\. Together, Figure[5](https://arxiv.org/html/2606.17735#S4.F5), Table[4](https://arxiv.org/html/2606.17735#S4.T4), Table[5](https://arxiv.org/html/2606.17735#S4.T5), and Figure[6](https://arxiv.org/html/2606.17735#S5.F6)show that E3RL benefits from a sufficiently large erasure budget and a balanced segmentation scheme, while overly restrictive rollback or overly coarse segmentation weakens local correction\.

Erase@k Scaling\.As shown in Figure[5](https://arxiv.org/html/2606.17735#S4.F5), larger Erase@k budgets produce more stable training dynamics\. Erase@5 achieves the highest accuracy curve and maintains smoother improvement than Erase@1, indicating that allowing multiple localized retries gives the model a broader search space before committing an uncertain segment\. Table[4](https://arxiv.org/html/2606.17735#S4.T4)provides quantitative confirmation: Erase@5 obtains the best Avg@32 on all seven benchmarks, improving over Erase@1 from 0\.855 to 0\.887 on AMC 2023, from 0\.543 to 0\.575 on AIME 2024, and from 0\.642 to 0\.654 on OlympiadBench\. The monotonic trend from Erase@1 to Erase@3 and Erase@5 suggests that repeated erasure mainly helps difficult reasoning problems where early local defects are more likely to induce downstream collapse\.

Segment\-Granularity Trade\-off\.Table[5](https://arxiv.org/html/2606.17735#S4.T5)shows that segmentation itself has a non\-trivial optimum\. The 16×\\times512 configuration achieves the best Avg@32 across all benchmarks, reaching 0\.896 on AMC 2023, 0\.583 on AIME 2024, 0\.525 on AIME 2026, and 0\.662 on OlympiadBench\. In contrast, 32×\\times256 underperforms because overly short segments may fragment coherent reasoning and trigger excessive local decisions, while 2×\\times4096 performs worst because long segments delay error detection and resemble ordinary irreversible generation\. The default 8×\\times1024 setting remains competitive, but the 16×\\times512 result indicates that finer yet still semantically meaningful checkpoints can better balance correction precision and reasoning continuity\.

Overall Analysis\.These results indicate that E3RL is not merely benefiting from additional sampling, but from where and how additional computation is spent\. Increasing Erase@k improves the ability to repair unstable segments, while the segment\-level dynamics in Figure[6](https://arxiv.org/html/2606.17735#S5.F6)show that uncertainty is concentrated unevenly across the reasoning trajectory\.

![Refer to caption](https://arxiv.org/html/2606.17735v1/x4.png)Figure 6:Segment\-level training dynamics ofE3​RL\\text\{E\}^\{3\}\\text\{RL\}under different Erase@k settings\.
## 6Related Work

Recent reinforcement learning methods have substantially advanced the reasoning capabilities of large language models\. GRPO\(shao2024deepseekmath\)removes the critic model used in PPO and estimates advantages from group\-level relative rewards, reducing the training cost of RL for reasoning models\. DAPO\(yu2025dapo\)further improves large\-scale RL training through decoupled clipping, dynamic sampling, token\-level policy\-gradient loss, and overlong reward shaping, alleviating entropy collapse and improving training stability\. GSPO\(zheng2025group\)replaces token\-level importance ratios with sequence\-level likelihood ratios, making policy optimization more stable and better aligned with sequence\-level rewards\. SAPO\(gao2025soft\)introduces soft adaptive policy optimization by replacing hard clipping with smooth temperature\-controlled scaling, enabling more stable and informative policy updates\. BAPO\(xi2025bapo\)studies off\-policy RL for LLMs and uses balanced adaptive clipping to preserve policy entropy and stabilize optimization under stale\-policy data\. DisCO\(li2025disco\)reformulates reasoning RL as discriminative constrained optimization, mitigating the question\-level difficulty bias of group\-relative objectives and improving training stability\.

## 7limitation & future discussion

AlthoughE3​RL\\text\{E\}^\{3\}\\text\{RL\}achieves consistent gains on long\-horizon mathematical reasoning benchmarks, several aspects remain worth further exploration\. First, our experiments focus on mathematical and logical reasoning, where long\-chain dependency and error propagation are especially prominent\. This setting provides a natural testbed for studying the autoregressive curse, while future work may further examine the applicability of erasable reinforcement learning to other structured generation tasks, such as code reasoning, tool\-augmented problem solving, and multi\-turn agentic workflows\. Second, the current implementation uses predefined segment length, smoothing window size, and erasure\-attempt budget\. These choices are lightweight and work well in our experiments, but more adaptive controllers may further improve efficiency across different model scales and task distributions\. For example, future systems could dynamically adjust the segment granularity according to the local uncertainty profile of each reasoning trajectory\.

## 8Conclusion

In this paper, we propose Dynamic Epistemic Entropy Orchestrated Erasable Reinforcement Learning \(E3​RL\\text\{E\}^\{3\}\\text\{RL\}\), a segment\-level reinforcement learning framework for mitigating error propagation in long\-horizon autoregressive reasoning\. Instead of treating generation as a fully irreversible one\-shot trajectory,E3​RL\\text\{E\}^\{3\}\\text\{RL\}introduces an endogenous uncertainty\-driven erasure mechanism that detects high\-risk reasoning segments, rolls them back, and regenerates them before they contaminate subsequent decisions\. By combining epistemic entropy monitoring, adaptive thresholding, frequency\-aware erasure control, and segment\-level advantage assignment, the proposed method provides a lightweight mechanism for local correction without relying on external process reward models or full\-sequence rejection\. Experiments on multiple mathematical reasoning benchmarks demonstrate thatE3​RL\\text\{E\}^\{3\}\\text\{RL\}consistently improves reasoning performance across both Qwen3\-4B and Qwen3\-8B models\. The ablation studies further verify the importance of the proposed uncertainty metric and system\-level erasure mechanisms, indicating that effective long\-horizon reasoning benefits from both reliable local uncertainty estimation and structurally compatible policy optimization\. Overall,E3​RL\\text\{E\}^\{3\}\\text\{RL\}offers a practical step toward more robust autoregressive reasoning systems with endogenous error\-correction capabilities\.

## References

## Appendix AEthics Statement

This work studies reinforcement learning methods for improving the reasoning capabilities of large language models, with experiments conducted on mathematical and logical reasoning benchmarks\. The proposed method does not involve human subjects, private user data, or personally identifiable information\. All training and evaluation data used in this study are drawn from publicly available or standard research datasets\. The primary goal ofE3​RL\\text\{E\}^\{3\}\\text\{RL\}is to improve the reliability and sample efficiency of long\-horizon reasoning by reducing the propagation of local reasoning errors\. While stronger reasoning models may provide broad benefits in education, scientific problem solving, and automated assistance, they may also be adapted for unintended uses when deployed without appropriate safeguards\. As with other methods that enhance LLM reasoning, practical deployment should be accompanied by standard safety measures, including content filtering, misuse monitoring, and human oversight in high\-stakes applications\. We also note thatE3​RL\\text\{E\}^\{3\}\\text\{RL\}operates as a training and inference mechanism and does not introduce external reward models or additional sources of user data\. Therefore, its ethical considerations are primarily aligned with those of general\-purpose large language models\. We encourage future applications of this framework to follow responsible AI practices, including transparent evaluation, careful domain\-specific validation, and appropriate limitations on use in sensitive decision\-making contexts\.

## Appendix BReproducibility

We take several steps to support the reproducibility of our experiments\. First, we clearly specify the model backbones, training data, evaluation benchmarks, and metrics used throughout the paper\. All experiments are conducted on the Qwen3\-4B and Qwen3\-8B architectures, trained with 51k samples selected from the DeepMath\-103k dataset, and evaluated on AMC 2023, AIME 2024, AIME 2025, AIME 2026, MATH 500, Minerva, and OlympiadBench usingAvg@32andPass@32\. Second, the main algorithmic components ofE3​RL\\text\{E\}^\{3\}\\text\{RL\}are described in Section[3](https://arxiv.org/html/2606.17735#S3), including segmented generation, epistemic entropy monitoring, adaptive thresholding, erasure control, and segment\-level advantage assignment\. The mathematical definitions of the uncertainty metric, dynamic threshold, erasure operator, and optimization objective are provided to make the training procedure implementable without relying on unspecified external modules\. Third, all baseline methods are evaluated under the same model scales, training data, and benchmark protocols to ensure a fair comparison\. The ablation studies in Section[5](https://arxiv.org/html/2606.17735#S5)further isolate the contribution of each major component, making it possible to verify the functional role of the proposed design choices\. For full reproducibility, implementation details such as training hyperparameters, decoding configuration, segment length, smoothing window size, maximum erasure attempts, batch size, learning rate schedule, and hardware setup will be provided in the appendix\. This will allow future work to reproduce the reported results and extendE3​RL\\text\{E\}^\{3\}\\text\{RL\}to other reasoning tasks and model scales\.

## Appendix CLLM usage

We partially used large language models \(LLMs\) exclusively for non\-scientific writing assistance, specifically for language polishing, clarity improvement, and suggestions\. No parts of the core methodology, experiments, or results were generated by LLMs\.

## Appendix DExperimental Setup

Datasets and Benchmarks\.We train all models on 51k samples selected from DeepMath\-103k\. To evaluate long\-horizon mathematical reasoning ability, we report results on seven benchmarks: AMC 2023, AIME 2024, AIME 2025, AIME 2026, MATH 500, Minerva, and OlympiadBench\. Following common practice in mathematical reasoning evaluation, we useAvg@32andPass@32as the main metrics\.

Models and Baselines\.We conduct experiments on two model scales based on the Qwen3 architecture: Qwen3\-4B and Qwen3\-8B\. We compareE3​RL\\text\{E\}^\{3\}\\text\{RL\}with several strong RL baselines, including Vanilla, GRPO, DAPO, GSPO, and SAPO\. All methods are trained and evaluated under the same data split, model backbone, and evaluation protocols to ensure a fair comparison\.

Training Configuration\.All models are trained for 2 epochs with a batch size of 128\. We use a learning rate of1×10−61\\times 10^\{\-6\}, a warmup ratio of0\.050\.05, and a KL regularization coefficient of0\.040\.04\. The maximum context length is set to 13,312 tokens\. For segmented generation, the segment length is set to 1,024 tokens and the maximum number of segments is set to 8\. During sampling, we use temperature1\.01\.0, top\-p=0\.9p=0\.9, and top\-k=50k=50\. Each prompt is sampled with 8 generations during training\.

Erasable Generation Configuration\.ForE3​RL\\text\{E\}^\{3\}\\text\{RL\}, each reasoning trajectory is generated in a segmented manner\. High\-uncertainty segments are detected by the proposed epistemic entropy metric and can be erased and regenerated before being committed to the prefix\. The maximum number of erasure attempts for each segment is set to 5\. The frequency penalty is applied to discourage excessive repeated erasures, while the macro\-level group dynamics adaptively calibrates the erasure threshold according to the uncertainty distribution of sampled trajectories\.

Infrastructure and Evaluation\.Training is conducted on 32 NVIDIA H100 GPUs\. We adopt a distributed training–rollout setup, where rollout servers are used for efficient generation and synchronized with the training worker during RL optimization\. DeepSpeed ZeRO\-2 is used for distributed training\. We evaluate the model every 50 training steps and save checkpoints at the same interval\. The final reported results are computed using the same decoding and evaluation protocol across all compared methods\.

## Appendix EComplexity Analysis

We analyze the computational cost ofE3​RL\\text\{E\}^\{3\}\\text\{RL\}and compare it with standard GRPO\. LetTTdenote the maximum generation length,GGthe number of sampled responses per prompt,ddthe hidden dimension,MMthe number of Transformer layers,\|𝒱\|\|\\mathcal\{V\}\|the vocabulary size,ℓ\\ellthe segment length, andN=⌈T/ℓ⌉N=\\lceil T/\\ell\\rceilthe number of segments\.

Complexity of GRPO\.In standard GRPO, each prompt is associated withGGsampled trajectories\. The dominant computation comes from autoregressive generation and policy\-gradient optimization over the generated sequences\. For a Transformer language model, the per\-sequence computation can be written as

CLM​\(T\)=𝒪​\(M​T2​d\+M​T​d2\),C\_\{\\mathrm\{LM\}\}\(T\)=\\mathcal\{O\}\\left\(MT^\{2\}d\+MTd^\{2\}\\right\),\(27\)where the first term corresponds to self\-attention and the second term corresponds to feed\-forward and projection operations\. Therefore, the per\-update complexity of GRPO is

CGRPO=𝒪​\(G​M​T2​d\+G​M​T​d2\)\.C\_\{\\mathrm\{GRPO\}\}=\\mathcal\{O\}\\left\(GMT^\{2\}d\+GMTd^\{2\}\\right\)\.\(28\)When the hidden dimension is large, the feed\-forward term often dominates, but we keep both terms for completeness\.

Additional cost ofE3​RL\\text\{E\}^\{3\}\\text\{RL\}\.Compared with GRPO,E3​RL\\text\{E\}^\{3\}\\text\{RL\}introduces three additional computational components\. First, epistemic entropy is computed from the output distribution\. Since logits are already produced during generation, this only requires an additional reduction over the vocabulary:

Cent=𝒪​\(G​T​\|𝒱\|\)\.C\_\{\\mathrm\{ent\}\}=\\mathcal\{O\}\\left\(GT\|\\mathcal\{V\}\|\\right\)\.\(29\)Second, segment\-level statistics, smoothing, and threshold comparison are computed once per segment:

Cseg=𝒪​\(G​N​W\),C\_\{\\mathrm\{seg\}\}=\\mathcal\{O\}\\left\(GNW\\right\),\(30\)whereWWis the smoothing window size\. SinceN=T/ℓN=T/\\ell, this term is linear in the number of generated tokens\.

Third,E3​RL\\text\{E\}^\{3\}\\text\{RL\}may erase and regenerate high\-uncertainty segments\. Leteedenote the expected total number of segment regenerations per sampled trajectory\. The corresponding expected regenerated length isℓ​e\\ell e, and the relative regeneration ratio is

r=ℓ​eT\.r=\\frac\{\\ell e\}\{T\}\.\(31\)The extra language\-model computation caused by erasure is therefore

Cerase=𝒪​\(r​G​M​T2​d\+r​G​M​T​d2\)\.C\_\{\\mathrm\{erase\}\}=\\mathcal\{O\}\\left\(rGMT^\{2\}d\+rGMTd^\{2\}\\right\)\.\(32\)
Combining the above terms, the total expected complexity ofE3​RL\\text\{E\}^\{3\}\\text\{RL\}is

CE3​RL=𝒪\(\\displaystyle C\_\{\\text\{E\}^\{3\}\\text\{RL\}\}=\\mathcal\{O\}\\Big\(\(1\+r\)​G​M​T2​d\+\(1\+r\)​G​M​T​d2\\displaystyle\(1\+r\)GMT^\{2\}d\+\(1\+r\)GMTd^\{2\}\(33\)\+GT\|𝒱\|\+GNW\)\.\\displaystyle\+GT\|\\mathcal\{V\}\|\+GNW\\Big\)\.Equivalently, relative to GRPO, we have

CE3​RLCGRPO=1\+r\+𝒪​\(\|𝒱\|\+W/ℓM​\(T​d\+d2\)\)\.\\frac\{C\_\{\\text\{E\}^\{3\}\\text\{RL\}\}\}\{C\_\{\\mathrm\{GRPO\}\}\}=1\+r\+\\mathcal\{O\}\\left\(\\frac\{\|\\mathcal\{V\}\|\+W/\\ell\}\{M\(Td\+d^\{2\}\)\}\\right\)\.\(34\)
This shows thatE3​RL\\text\{E\}^\{3\}\\text\{RL\}preserves the same asymptotic order as GRPO when the expected regeneration ratiorris bounded\. The additional entropy monitoring and thresholding operations are lightweight compared with the Transformer forward and backward computations, while the main extra cost comes from the regenerated segments\. Since erasure is performed locally at the segment level rather than by discarding complete trajectories, the overhead scales with the amount of regenerated content instead of the full sequence length\.

Memory footprint\.E3​RL\\text\{E\}^\{3\}\\text\{RL\}maintains only one active segmented trajectory for each sampled response\. When a segment is erased, the method rolls back to the cached prefix state and regenerates the current segment without maintaining a branching search tree\. Therefore, its rollout memory usage remains linear in the sequence length:

ℳE3​RL=𝒪​\(G​M​T​d\)\+𝒪​\(G​N\),\\mathcal\{M\}\_\{\\text\{E\}^\{3\}\\text\{RL\}\}=\\mathcal\{O\}\\left\(GMTd\\right\)\+\\mathcal\{O\}\\left\(GN\\right\),\(35\)where the first term corresponds to KV\-cache storage and the second term corresponds to segment\-level statistics\. This contrasts with tree\-based reasoning methods, whose memory may grow with the number of maintained branches\. Thus,E3​RL\\text\{E\}^\{3\}\\text\{RL\}introduces local correction while retaining a linear memory footprint\.

MethodPass@kQwen3\-4BQwen3\-8Bk=1k\{=\}1k=8k\{=\}8k=16k\{=\}16k=32k\{=\}32k=64k\{=\}64k=128k\{=\}128k=256k\{=\}256k=1k\{=\}1k=8k\{=\}8k=16k\{=\}16k=32k\{=\}32k=64k\{=\}64k=128k\{=\}128k=256k\{=\}256AMC 2023Vanilla0\.7000\.8750\.9000\.9500\.9500\.9500\.9500\.6750\.8250\.8500\.8750\.9000\.9000\.925GRPO0\.8250\.9500\.9500\.9750\.9751\.0001\.0000\.8750\.9500\.9750\.9750\.9751\.0001\.000DAPO0\.8750\.9500\.9500\.9750\.9750\.9751\.0000\.9250\.9500\.9500\.9500\.9750\.9751\.000GSPO0\.7750\.9750\.9750\.9750\.9750\.9750\.9750\.9500\.9500\.9500\.9500\.9501\.0001\.000SAPO0\.9000\.9500\.9750\.9751\.0001\.0001\.0000\.9000\.9250\.9750\.9750\.9750\.9750\.975E3​RL\\text\{E\}^\{3\}\\text\{RL\}0\.9000\.9750\.9750\.9751\.0001\.0001\.0000\.8750\.9500\.9500\.9750\.9750\.9751\.000AIME 2024Vanilla0\.4330\.5330\.5670\.6330\.6670\.7330\.7670\.3000\.5330\.5670\.6000\.6000\.6670\.733GRPO0\.5000\.8000\.8000\.8330\.8330\.8330\.8670\.5670\.7000\.7670\.7670\.8000\.8330\.900DAPO0\.4330\.7670\.7670\.8000\.8330\.8330\.8330\.6330\.7330\.8000\.8000\.8670\.8670\.900GSPO0\.5000\.8000\.8330\.8330\.8330\.8330\.8670\.5670\.8000\.8000\.8330\.8330\.8670\.900SAPO0\.5330\.7670\.8000\.8330\.8330\.8330\.8330\.6000\.7330\.7670\.7670\.7670\.8330\.833E3​RL\\text\{E\}^\{3\}\\text\{RL\}0\.5330\.7670\.8000\.8330\.8330\.8670\.8670\.6000\.6670\.7330\.7670\.8330\.8670\.900AIME 2025Vanilla0\.2330\.4000\.4670\.5330\.5330\.5330\.5670\.2330\.4000\.4330\.4670\.4670\.4670\.500GRPO0\.4330\.6670\.7000\.7000\.7670\.8330\.8330\.4330\.6000\.6330\.6330\.7000\.7670\.833DAPO0\.3670\.5670\.7000\.8330\.8330\.8670\.8670\.4330\.5670\.6000\.6670\.7000\.8000\.867GSPO0\.3670\.6000\.6330\.7000\.7000\.7670\.8000\.5330\.5670\.6000\.6330\.7330\.8000\.800SAPO0\.4670\.6330\.6670\.6670\.7000\.7330\.8330\.3670\.4670\.5000\.6000\.6670\.8000\.867E3​RL\\text\{E\}^\{3\}\\text\{RL\}0\.4000\.6330\.7000\.7670\.8330\.8670\.8670\.5330\.6000\.6330\.8000\.8330\.8670\.900AIME 2026Vanilla0\.1670\.4000\.4330\.5330\.5330\.5670\.5670\.2330\.4000\.4670\.5000\.5000\.5000\.533GRPO0\.3670\.6670\.6670\.6670\.8000\.8000\.8670\.4670\.7000\.7330\.7330\.7330\.7330\.733DAPO0\.5000\.5670\.6670\.7330\.7670\.8000\.8330\.5670\.7670\.7670\.7670\.8000\.8330\.833GSPO0\.2670\.5670\.6330\.6670\.7670\.8000\.8000\.5330\.7000\.7330\.7330\.7670\.7670\.767SAPO0\.3330\.6330\.6670\.6670\.6670\.7670\.8330\.5000\.6330\.7000\.7670\.7670\.8000\.833E3​RL\\text\{E\}^\{3\}\\text\{RL\}0\.4330\.6670\.7330\.7330\.7670\.7670\.8330\.5000\.6670\.7000\.7330\.8000\.8000\.833MATH 500Vanilla0\.6920\.7360\.7520\.7540\.7620\.7640\.7660\.6800\.7360\.7400\.7500\.7580\.7620\.766GRPO0\.7300\.7660\.7740\.7760\.7820\.7820\.7860\.7560\.7700\.7740\.7760\.7800\.7840\.788DAPO0\.7280\.7700\.7780\.7780\.7860\.7860\.7920\.7540\.7720\.7780\.7800\.7880\.7900\.792GSPO0\.7260\.7700\.7760\.7780\.7860\.7880\.7920\.7420\.7740\.7780\.7820\.7820\.7840\.786SAPO0\.7300\.7680\.7740\.7820\.7860\.7860\.7920\.7560\.7780\.7820\.7860\.7860\.7920\.792E3​RL\\text\{E\}^\{3\}\\text\{RL\}0\.7280\.7740\.7820\.7860\.7880\.7920\.8020\.7580\.7800\.7820\.7860\.7880\.7900\.792MinervaVanilla0\.2170\.2650\.2790\.2980\.3160\.3230\.3270\.1840\.2610\.2790\.2860\.2980\.3130\.327GRPO0\.2320\.3050\.3200\.3350\.3420\.3490\.3600\.2460\.3010\.3200\.3460\.3530\.3680\.390DAPO0\.2240\.3050\.3200\.3350\.3380\.3460\.3600\.2090\.3010\.3240\.3310\.3490\.3710\.379GSPO0\.2540\.2870\.3130\.3270\.3380\.3530\.3640\.2280\.3130\.3310\.3460\.3490\.3790\.382SAPO0\.2500\.3050\.3240\.3460\.3490\.3600\.3640\.2240\.3050\.3420\.3510\.3710\.3820\.382E3​RL\\text\{E\}^\{3\}\\text\{RL\}0\.2350\.3090\.3200\.3310\.3350\.3680\.3820\.2540\.3090\.3270\.3570\.3780\.3900\.393OlympiadBenchVanilla0\.4950\.5880\.6130\.6330\.6550\.6700\.6840\.4740\.5750\.5960\.6190\.6390\.6590\.674GRPO0\.5940\.7420\.7720\.7870\.8000\.8180\.8430\.6250\.7530\.7760\.8010\.8290\.8400\.852DAPO0\.6070\.7440\.7640\.7990\.8180\.8370\.8510\.6430\.7640\.7880\.8050\.8330\.8380\.849GSPO0\.6090\.7440\.7820\.8070\.8270\.8400\.8490\.6310\.7570\.7850\.7980\.8250\.8370\.846SAPO0\.6120\.7470\.7760\.7970\.8130\.8250\.8490\.6370\.7570\.7840\.8050\.8210\.8330\.847E3​RL\\text\{E\}^\{3\}\\text\{RL\}0\.5960\.7350\.7720\.8130\.8180\.8390\.8540\.6530\.7720\.7810\.8160\.8350\.8420\.858

Table 6:Comprehensive Pass@k results for Qwen3\-4B and Qwen3\-8B across different benchmarks and values ofkk\.MethodAMC 2023AIME 2024AIME 2025AIME 2026MATH 500MinervaOlympiadBenchQwen3\-4BE3​RL\\text\{E\}^\{3\}\\text\{RL\}0\.9750\.8330\.7670\.7330\.7860\.3310\.813w/o extremum deviation0\.975\-0\.8000\.7000\.7820\.3270\.813w/o gradient anomaly0\.9500\.8000\.7330\.6670\.7840\.3310\.807w/o base uncertainty0\.9500\.7670\.7670\.7000\.7760\.3350\.804ow base uncertainty0\.9750\.8330\.7670\.7330\.7820\.3290\.811ow gradient anomaly0\.9500\.7670\.7330\.7000\.7780\.3270\.807ow extremum deviation0\.9500\.7670\.7330\.7330\.7800\.3290\.793Table 7:Ablation study on cognitive entropy components evaluated on the Pass@32 metric for Qwen3\-4B\.MethodAMC 2023AIME 2024AIME 2025AIME 2026MATH 500MinervaOlympiadBenchQwen3\-4BE3​RL\\text\{E\}^\{3\}\\text\{RL\}0\.9750\.8330\.7670\.7330\.7860\.3310\.813w/o frequency penalty0\.9500\.8330\.8000\.7330\.7840\.3350\.809w/o causal allocation0\.9750\.8330\.7670\.7000\.7840\.3270\.809w/o group dynamics0\.9750\.8000\.7670\.6670\.7780\.3310\.805ow group dynamics0\.9750\.8000\.7670\.6670\.7820\.3270\.805ow causal allocation0\.9750\.8330\.8000\.7000\.7820\.3240\.801ow frequency penalty0\.9500\.8000\.7670\.6670\.7760\.3270\.803Table 8:Ablation study on system mechanisms evaluated on the Pass@32 metric for Qwen3\-4B\.

Similar Articles

When Do LLMs Reason? A Dynamical Systems View via Entropy Phase Transitions

arXiv cs.LG

This paper investigates when chain-of-thought reasoning is beneficial for LLMs, showing that early-stage entropy dynamics reliably indicate reasoning utility, and introduces EDRM, a lightweight, training-free framework that adaptively selects inference strategies to achieve significant token savings while maintaining or improving accuracy.

Revisiting Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning

arXiv cs.CL

This paper proposes Adaptive Entropy Regularization (AER), a framework that dynamically balances exploration and exploitation in LLM reinforcement learning by addressing policy entropy collapse through difficulty-aware coefficient allocation and initial-anchored target entropy. Experiments on mathematical reasoning benchmarks demonstrate consistent improvements in both accuracy and exploration capability.

Learning to Refine Hidden States for Reliable LLM Reasoning

arXiv cs.LG

Proposes ReLAR, a reinforcement-guided latent refinement framework that iteratively updates hidden representations in LLMs before decoding, improving reasoning reliability and efficiency compared to chain-of-thought methods.