ARCA: Adapter-Residual Credit Assignment When Token Signals Degenerate

arXiv cs.LG Papers

Summary

This paper identifies a structural failure mode in token-level credit assignment for LLM reinforcement learning when using LoRA, where intrinsic signals degenerate. It proposes Adapter-Residual Credit Assignment (ARCA), which derives token salience from adapter hidden-state residuals and remains competitive with baselines.

arXiv:2606.00257v1 Announce Type: new Abstract: Token-level credit assignment for language-model reinforcement learning is usually formulated as if the policy were fully trainable, while practical LLM-RL pipelines often rely on parameter-efficient fine-tuning, especially LoRA. We argue that this separation hides a structural failure mode. Under LoRA, the policy is restricted to a low-rank neighborhood of the reference model, so the per-token output-distribution differences used by common intrinsic credit signals, surprisal, entropy reduction, and policy divergence, can become degenerate after within-trajectory normalization, either approaching uniform weights or concentrating on a small set of task-agnostic positions. We formalize this behavior and propose measuring it directly with concentration diagnostics such as weight Gini and effective-token ratio. We then introduce \emph{Adapter-Residual Credit Assignment} (ARCA), a lightweight alternative that derives token salience from the adapter's own hidden-state residual, $\|h^{\text{adapted}}_t - h^{\text{base}}_t\|_2$. ARCA asks where the adapter actually changes the model, rather than where the output distribution appears uncertain or shifted, and requires no learned reward model, value head, or tree construction. In a compact MATH/Qwen3-1.7B GRPO sweep, ARCA exhibits the predicted non-degenerate middle-regime credit distribution under matched rollout budgets and remains competitive with rank-matched baselines.
Original Article
View Cached Full Text

Cached at: 06/02/26, 03:40 PM

# Adapter-Residual Credit Assignment When Token Signals Degenerate
Source: [https://arxiv.org/html/2606.00257](https://arxiv.org/html/2606.00257)
###### Abstract

Token\-level credit assignment for language\-model reinforcement learning is usually formulated as if the policy were fully trainable, while practical LLM\-RL pipelines often rely on parameter\-efficient fine\-tuning, especially LoRA\. We argue that this separation hides a structural failure mode\. Under LoRA, the policy is restricted to a low\-rank neighborhood of the reference model, so the per\-token output\-distribution differences used by common intrinsic credit signals, surprisal, entropy reduction, and policy divergence, can become degenerate after within\-trajectory normalization, either approaching uniform weights or concentrating on a small set of task\-agnostic positions\. We formalize this behavior and propose measuring it directly with concentration diagnostics such as weight Gini and effective\-token ratio\. We then introduce*Adapter\-Residual Credit Assignment*\(ARCA\), a lightweight alternative that derives token salience from the adapter’s own hidden\-state residual,‖htadapted−htbase‖2\\\|h^\{\\text\{adapted\}\}\_\{t\}\-h^\{\\text\{base\}\}\_\{t\}\\\|\_\{2\}\. ARCA asks where the adapter actually changes the model, rather than where the output distribution appears uncertain or shifted, and requires no learned reward model, value head, or tree construction\. In a compact MATH/Qwen3\-1\.7B GRPO sweep, ARCA exhibits the predicted non\-degenerate middle\-regime credit distribution under matched rollout budgets and remains competitive with rank\-matched baselines\.

Machine Learning, ICML

#### Code\.

## 1Introduction

Reinforcement learning has become a central component of large language model \(LLM\) post\-training, particularly for alignment and for reasoning\-oriented tasks with verifiable rewards\. A persistent challenge across these settings is*credit assignment*: trajectories are long, rewards are sparse and outcome\-level, and it is not obvious how a single scalar signal at the end of a generation should be distributed across hundreds or thousands of token decisions\. Recent work has responded with a rapidly growing toolbox of token\-level credit\-assignment methods, including entropy\-aware modulation, process reward models, tree\-based prefix values, and reward redistribution\(Cuiet al\.,[2025](https://arxiv.org/html/2606.00257#bib.bib21); Liet al\.,[2024a](https://arxiv.org/html/2606.00257#bib.bib19); Tranet al\.,[2025](https://arxiv.org/html/2606.00257#bib.bib15); Wanget al\.,[2025b](https://arxiv.org/html/2606.00257#bib.bib28); Heet al\.,[2026](https://arxiv.org/html/2606.00257#bib.bib33); Yuet al\.,[2026](https://arxiv.org/html/2606.00257#bib.bib35); Kazemnejadet al\.,[2024](https://arxiv.org/html/2606.00257#bib.bib20)\)\.

In parallel, many open\-source and academic LLM\-RL pipelines use parameter\-efficient fine\-tuning, especially LoRA\(Huet al\.,[2022](https://arxiv.org/html/2606.00257#bib.bib37)\), because of memory and compute constraints; widely used frameworks such as TRL and verl support this workflow\. Recent results show that LoRA \+ RL can be dramatically more sample\- and parameter\-efficient than full fine\-tuning\(Wanget al\.,[2025a](https://arxiv.org/html/2606.00257#bib.bib38)\)\. Yet the two research threads are developed in near\-total isolation\. Papers on token\-level credit assignment specify methods without reference to the adaptation strategy, and PEFT is treated as an orthogonal implementation detail\.

In this work we argue that this separation is a mistake, and that the*interaction*between PEFT and credit assignment is itself a first\-class methodological issue\. The core issue is geometric\. Intrinsic token\-level weighting schemes derive per\-token salience from quantities such as surprisal, entropy reduction, or divergence between the policy and a reference model\. Under LoRA, however, the policy is constrained to a small low\-rank neighborhood of the reference, and the per\-token differences that these signals measure can lose the variation that made them useful\. We formalize this as degeneration of the normalized salience distribution, characterize it with Gini coefficient and effective token count, and show that the same mechanism applies across multiple output\-distribution\-based weighting rules\. This gives a mechanistic explanation for the empirically observed failure of full\-token GRPO \+ LoRA\(Lee and Tong,[2025](https://arxiv.org/html/2606.00257#bib.bib40)\): the weighting signal itself is degraded before training ever begins\.

Rather than trying to recover per\-token structure from signals that LoRA intrinsically flattens, we measure salience directly from the adapter’s own contribution to the forward pass\. Adapter\-Residual Credit Assignment \(ARCA\) sets the salience at positionttequal to the norm of the adapter residual,‖htadapted−htbase‖2\\\|h^\{\\text\{adapted\}\}\_\{t\}\-h^\{\\text\{base\}\}\_\{t\}\\\|\_\{2\}, computed via a forward pass with the adapter disabled\. This signal is positive whenever the adapter is active, non\-uniform across positions because adapter input activations are non\-uniform, and requires no additional networks, learned reward models, or tree construction\.

Concretely, this paper makes three contributions:

1. 1\.We identify and formalize a PEFT\-specific failure mode of token credit assignment: under LoRA, output\-distribution salience can degenerate into uniform broadcast or spurious sparsity, even when the underlying RL objective is unchanged\.
2. 2\.We introduce ARCA, a lightweight adapter\-residual credit assignment rule whose normalized token weights remain non\-degenerate whenever the adapter has position\-varying hidden\-state impact\.
3. 3\.We validate the mechanism with concentration diagnostics and a matched MATH/Qwen3\-1\.7B sweep, separating downstream performance from the more basic question of whether a proposed token\-credit signal survives the adaptation regime actually used in LLM\-RL\.

The remainder of the paper develops these contributions\. Section[2](https://arxiv.org/html/2606.00257#S2)positions the work relative to PEFT for language RL and token\-level credit assignment\. Section[3](https://arxiv.org/html/2606.00257#S3)presents the weighting schemes, explains why output\-distribution signals degenerate under LoRA, and defines ARCA\. Section[4](https://arxiv.org/html/2606.00257#S4)reports the diagnostic and performance comparison on MATH with Qwen3\-1\.7B under the seven\-run reported sweep\. Extended related work and theoretical interpretation appear in Appendices[A](https://arxiv.org/html/2606.00257#A1)and[C](https://arxiv.org/html/2606.00257#A3)\.

## 2Related Work

### 2\.1PEFT and Low\-Rank Adaptation in Language RL

Parameter\-efficient fine\-tuning, and in particular LoRA\(Huet al\.,[2022](https://arxiv.org/html/2606.00257#bib.bib37)\), is a dominant adaptation strategy in open\-source LLM\-RL work\. Tina demonstrated that tiny LoRA adapters on a 1\.5B base model suffice to reach DeepSeek\-R1\-class reasoning behavior when combined with GRPO, at a tiny fraction of full fine\-tuning cost\(Wanget al\.,[2025a](https://arxiv.org/html/2606.00257#bib.bib38)\)\. A broader systematic evaluation of PEFT methods under RLVR\(Yinet al\.,[2025](https://arxiv.org/html/2606.00257#bib.bib39)\)benchmarks over a dozen PEFT variants \(including DoRA, AdaLoRA, MiSS, PiSSA, MiLoRA, VeRA and Rank\-1\) on DeepSeek\-R1\-Distill models and finds that standard LoRA is not optimal; structural variants such as DoRA, AdaLoRA and MiSS consistently outperform it, and SVD\-initialized variants \(PiSSA, MiLoRA\) suffer from*spectral collapse*, a finding that is broadly compatible with our own signal\-degeneration analysis\. Token\-Efficient RL introduces critic\-free LoRA\-compatible variants of GRPO \(S\-GRPO and T\-SPMO\) that focus training on a subset of tokens, reporting large gains on small models where full\-token GRPO \+ LoRA trains unstably\(Lee and Tong,[2025](https://arxiv.org/html/2606.00257#bib.bib40)\); we now provide a mechanistic explanation for that instability via signal degeneration\.*LoRA as an Implicit KL Regularizer*analyzes how LoRA restricts the policy to a rank\-constrained neighborhood of the reference, deriving an explicit rank\-dependent upper bound on the KL divergence between policy and reference throughout training\(Anonymous,[2026](https://arxiv.org/html/2606.00257#bib.bib41)\)\. Our Section[3\.4](https://arxiv.org/html/2606.00257#S3.SS4)builds directly on this observation: an implicit KL bound is the same mechanism that forces per\-token log\-probability differentials to be small, which is the formal starting point for our degeneration result\.

On the analysis side,*Narrow Fine\-Tuning Traces*shows that narrow fine\-tuning leaves clearly readable traces in the activation differences between base and fine\-tuned hidden states, and that the fine\-tuning domain can be recovered from those differences alone using simple diffing tools such as patchscopes and activation steering\(Minderet al\.,[2025](https://arxiv.org/html/2606.00257#bib.bib43)\)\. This is directly relevant to ARCA because the adapter residualhtadapted−htbaseh^\{\\mathrm\{adapted\}\}\_\{t\}\-h^\{\\mathrm\{base\}\}\_\{t\}is a per\-position activation difference of exactly the kind they study; their results supply empirical evidence that this difference carries meaningful, semantically nontrivial content, the property our method exploits\. Concurrent work on TopLoRA studies how to concentrate LoRA capacity on a small number of high\-impact tokens via token\-wise input\-output projections\(Liet al\.,[2025](https://arxiv.org/html/2606.00257#bib.bib44)\)\. To our knowledge, no prior work has characterized the interaction between LoRA and token\-level credit assignment, nor proposed an adaptation\-aware intrinsic weighting scheme\.

### 2\.2Positioning of This Work

The literature makes three points clear\. First, critics are not strictly required for effective LLM post\-training: critic\-free estimators such as RLOO, ReMax, GRPO, and GSPO are already competitive in both RLHF and RLVR\(Ahmadianet al\.,[2024](https://arxiv.org/html/2606.00257#bib.bib8); Liet al\.,[2024b](https://arxiv.org/html/2606.00257#bib.bib9); Shaoet al\.,[2024](https://arxiv.org/html/2606.00257#bib.bib11); Zhenget al\.,[2025](https://arxiv.org/html/2606.00257#bib.bib25)\)\. Second, many researchers have concluded that trajectory\-level rewards are too coarse and have introduced denser supervision through process reward models, redistribution rules, tree structures, optimal baselines, temporal traces, or entropy\-based modulation\(Liet al\.,[2024a](https://arxiv.org/html/2606.00257#bib.bib19); Cuiet al\.,[2025](https://arxiv.org/html/2606.00257#bib.bib21); Tranet al\.,[2025](https://arxiv.org/html/2606.00257#bib.bib15); Caoet al\.,[2025](https://arxiv.org/html/2606.00257#bib.bib13); Parthasarathiet al\.,[2025](https://arxiv.org/html/2606.00257#bib.bib22); Liet al\.,[2026](https://arxiv.org/html/2606.00257#bib.bib31); Huet al\.,[2026](https://arxiv.org/html/2606.00257#bib.bib32); Heet al\.,[2026](https://arxiv.org/html/2606.00257#bib.bib33); Yuet al\.,[2026](https://arxiv.org/html/2606.00257#bib.bib35); Menget al\.,[2026](https://arxiv.org/html/2606.00257#bib.bib42)\)\. Third, the overwhelming majority of practical LLM\-RL pipelines are*trained with LoRA*, and a focused sub\-literature has studied PEFT\-specific effects in this regime\(Huet al\.,[2022](https://arxiv.org/html/2606.00257#bib.bib37); Wanget al\.,[2025a](https://arxiv.org/html/2606.00257#bib.bib38); Yinet al\.,[2025](https://arxiv.org/html/2606.00257#bib.bib39); Anonymous,[2026](https://arxiv.org/html/2606.00257#bib.bib41); Lee and Tong,[2025](https://arxiv.org/html/2606.00257#bib.bib40); Minderet al\.,[2025](https://arxiv.org/html/2606.00257#bib.bib43); Liet al\.,[2025](https://arxiv.org/html/2606.00257#bib.bib44)\)\.

Our contribution is to bridge the second and third of these threads\. Existing intrinsic credit\-assignment signals \(surprisal, entropy, divergence\) are not novel as standalone objects, and we do not claim otherwise; they are our baselines\. Our novel claim is that these signals*interact pathologically with the adaptation strategy that the field often uses*: under LoRA they can degenerate into uniform broadcast or spurious sparsity, so any paper that reports a null result for “more elaborate intrinsic weighting versus uniform” under LoRA may be observing a LoRA artifact, not evidence against token\-level credit assignment\. We formalize this, supply a unified diagnostic \(Gini/EffN\) that makes the degeneration visible at training time, and propose ARCA as an adaptation\-aware alternative whose construction directly avoids the failure mode we identify\.

## 3Methods

We consider on\-policy reinforcement learning for autoregressive language models with trajectory\-level rewards\. Letπθ\\pi\_\{\\theta\}denote a language model parameterized byθ\\theta, generating a completiony=\(y1,…,yT\)y=\(y\_\{1\},\\dots,y\_\{T\}\)conditioned on a promptxx\. After sampling a full trajectory, the model receives a scalar rewardR​\(x,y\)∈ℝR\(x,y\)\\in\\mathbb\{R\}computed by an external verifier, such as exact\-answer correctness or unit\-test pass rate\. Our objective is

J​\(θ\)=𝔼x∼𝒟,y∼πθ\(⋅∣x\)​\[R​\(x,y\)\],J\(\\theta\)=\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\},\\,y\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)\}\[R\(x,y\)\],\(1\)where𝒟\\mathcal\{D\}denotes the prompt distribution\. We focus on the regime most relevant to recent reasoning\-model training: sparse outcome rewards, on\-policy sampling, and no learned value function\.

### 3\.1Policy Gradient with Trajectory\-Level Reward

For a fixed promptxx, the standard REINFORCE estimator is

grf​\(x,y\)=R​\(x,y\)​∑t=1T∇θlog⁡πθ​\(yt∣y<t,x\)\.g\_\{\\mathrm\{rf\}\}\(x,y\)=R\(x,y\)\\sum\_\{t=1\}^\{T\}\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(y\_\{t\}\\mid y\_\{<t\},x\)\.\(2\)In practice, a baselineb​\(x\)b\(x\)is subtracted to reduce variance without changing the expected gradient as long asb​\(x\)b\(x\)does not depend on the sampled action at tokentt:

gbase​\(x,y\)=\(R​\(x,y\)−b​\(x\)\)​∑t=1T∇θlog⁡πθ​\(yt∣y<t,x\)\.g\_\{\\mathrm\{base\}\}\(x,y\)=\\left\(R\(x,y\)\-b\(x\)\\right\)\\sum\_\{t=1\}^\{T\}\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(y\_\{t\}\\mid y\_\{<t\},x\)\.\(3\)
We use prompt\-level multi\-sample baselines\. GivenKKsampled completions\{y\(i\)\}i=1K\\\{y^\{\(i\)\}\\\}\_\{i=1\}^\{K\}for the same promptxxwith rewards\{R\(i\)\}i=1K\\\{R^\{\(i\)\}\\\}\_\{i=1\}^\{K\}:

bRLOO\(i\)​\(x\)=1K−1​∑j≠iR\(j\)b\_\{\\mathrm\{RLOO\}\}^\{\(i\)\}\(x\)=\\frac\{1\}\{K\-1\}\\sum\_\{j\\neq i\}R^\{\(j\)\}\(4\)is the leave\-one\-out baseline used in RLOO\-style estimators, while GRPO\-style normalization can be written as

AGRPO\(i\)​\(x\)=R\(i\)−μR​\(x\)σR​\(x\)\+ε,A\_\{\\mathrm\{GRPO\}\}^\{\(i\)\}\(x\)=\\frac\{R^\{\(i\)\}\-\\mu\_\{R\}\(x\)\}\{\\sigma\_\{R\}\(x\)\+\\varepsilon\},\(5\)whereμR​\(x\)\\mu\_\{R\}\(x\)andσR​\(x\)\\sigma\_\{R\}\(x\)are the within\-prompt mean and standard deviation of theKKrewards\. In both cases, the resulting scalar advantage is then broadcast uniformly over all tokens in the sampled trajectory\. This scalar\-broadcast assumption is precisely what we relax\.

### 3\.2Token\-Level Credit Redistribution

We introduce token\-level weightswt​\(x,y\)w\_\{t\}\(x,y\)that redistribute a trajectory\-level advantage across tokens:

gw​\(x,y\)=A​\(x,y\)​∑t=1Twt​\(x,y\)​∇θlog⁡πθ​\(yt∣y<t,x\),g\_\{w\}\(x,y\)=A\(x,y\)\\sum\_\{t=1\}^\{T\}w\_\{t\}\(x,y\)\\,\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(y\_\{t\}\\mid y\_\{<t\},x\),\(6\)whereA​\(x,y\)A\(x,y\)denotes eitherR​\(x,y\)−bRLOO​\(x\)R\(x,y\)\-b\_\{\\mathrm\{RLOO\}\}\(x\)orAGRPO​\(x,y\)A\_\{\\mathrm\{GRPO\}\}\(x,y\)depending on the baseline choice\.

The weights satisfy

∑t=1Twt​\(x,y\)=1,wt​\(x,y\)≥0\.\\sum\_\{t=1\}^\{T\}w\_\{t\}\(x,y\)=1,\\qquad w\_\{t\}\(x,y\)\\geq 0\.\(7\)This length normalization keeps the total update magnitude comparable across weighting schemes and isolates the effect of credit redistribution from simple rescaling\. It also means that the uniform baseline below is the length\-normalized version of the usual token\-sum estimator, rather than the unnormalized sum itself\.

Uniform token credit corresponds towt=1/Tw\_\{t\}=1/Tfor alltt\. Our goal is to replace this uniform allocation with*intrinsic*weights derived from quantities already produced by the policy during generation\. We do not train learned token\-level reward models, estimate prefix values, or build explicit search trees\.

### 3\.3Intrinsic Token\-Weighting Mechanisms

We define each method through an unnormalized salience scoreαt​\(x,y\)≥0\\alpha\_\{t\}\(x,y\)\\geq 0and then normalize within each trajectory:

wt​\(x,y\)=αt​\(x,y\)\+ε∑k=1T\(αk​\(x,y\)\+ε\),w\_\{t\}\(x,y\)=\\frac\{\\alpha\_\{t\}\(x,y\)\+\\varepsilon\}\{\\sum\_\{k=1\}^\{T\}\\left\(\\alpha\_\{k\}\(x,y\)\+\\varepsilon\\right\)\},\(8\)with a smallε\>0\\varepsilon\>0to avoid degenerate all\-zero cases\. In implementation, the weights are treated as*detached*scalars when multiplying the policy\-gradient term, so the update does not introduce second\-order derivatives through the weighting function itself\. This keeps the optimization rule close in spirit and cost to standard critic\-free policy gradients\. The floorε\\varepsilonis part of the estimator: if all raw scores vanish relative to this floor, the normalized weights become uniform, while without such a floor nearly\-zero scores can make the normalization ill\-conditioned\.

#### Uniform Weighting \(Baseline\)\.

wtuniform=1T\.w\_\{t\}^\{\\text\{uniform\}\}=\\frac\{1\}\{T\}\.\(9\)Combined with an RLOO or GRPO baseline, this gives a length\-normalized uniform\-token baseline that assigns identical credit to every token in the sampled completion\. It has the same token direction as the unnormalized critic\-free token\-sum estimator, but differs by the trajectory\-length factor1/T1/T\.

#### Surprisal Weighting\.

Our first intrinsic score is token surprisal,

αtsurp​\(x,y\)=−log⁡πθ​\(yt∣y<t,x\)\.\\alpha\_\{t\}^\{\\text\{surp\}\}\(x,y\)=\-\\log\\pi\_\{\\theta\}\(y\_\{t\}\\mid y\_\{<t\},x\)\.\(10\)This emphasizes low\-probability decisions under the current policy\. Intuitively, these are tokens where the model commits to a less routine continuation and where uniform credit assignment may be most diluted by predictable filler tokens\.

#### Entropy\-Reduction Weighting\.

Our second intrinsic score measures local entropy reduction\. Let

Htpre​\(x,y\)=−∑vπθ​\(v∣y<t,x\)​log⁡πθ​\(v∣y<t,x\)H\_\{t\}^\{\\mathrm\{pre\}\}\(x,y\)=\-\\sum\_\{v\}\\pi\_\{\\theta\}\(v\\mid y\_\{<t\},x\)\\log\\pi\_\{\\theta\}\(v\\mid y\_\{<t\},x\)\(11\)denote the predictive entropy before sampling tokenyty\_\{t\}, and let

Δ​Htent​\(x,y\)=max⁡\(0,Htpre​\(x,y\)−Ht\+1pre​\(x,y\)\)\\Delta H\_\{t\}^\{\\mathrm\{ent\}\}\(x,y\)=\\max\\left\(0,H\_\{t\}^\{\\mathrm\{pre\}\}\(x,y\)\-H\_\{t\+1\}^\{\\mathrm\{pre\}\}\(x,y\)\\right\)\(12\)fort<Tt<T\. For the final token we setΔ​HTent=0\\Delta H\_\{T\}^\{\\mathrm\{ent\}\}=0\. We then use

αtent​\(x,y\)=Δ​Htent​\(x,y\)\.\\alpha\_\{t\}^\{\\text\{ent\}\}\(x,y\)=\\Delta H\_\{t\}^\{\\mathrm\{ent\}\}\(x,y\)\.\(13\)This score highlights tokens after which the model’s next\-step distribution becomes substantially more concentrated\. Such events often correspond to commitment points in reasoning, where one branch of continuation becomes much more likely than its alternatives\.

#### Policy\-Divergence Weighting\.

A third intrinsic score measures how much the current policy has drifted from a referenceπref\\pi\_\{\\mathrm\{ref\}\}\(typically the base model prior to RL\) at each token position:

αtdiv​\(x,y\)=\|log⁡πθ​\(yt∣y<t,x\)−log⁡πref​\(yt∣y<t,x\)\|\.\\alpha\_\{t\}^\{\\mathrm\{div\}\}\(x,y\)=\\lvert\\log\\pi\_\{\\theta\}\(y\_\{t\}\\mid y\_\{<t\},x\)\-\\log\\pi\_\{\\mathrm\{ref\}\}\(y\_\{t\}\\mid y\_\{<t\},x\)\\rvert\.\(14\)In the RLHF/RLVR settingπref\\pi\_\{\\mathrm\{ref\}\}is already available because it is used to compute the KL regularizer, so this score is essentially free\.

#### Length\-Robust Normalization\.

Because reasoning trajectories can vary substantially in length, all weights are normalized within each sampled completion rather than across the minibatch\. This ensures that longer trajectories do not automatically receive larger aggregate updates simply because they contain more token positions with nonzero salience\. The comparison between weighting schemes therefore reflects*where*credit is assigned within a trajectory, not how much total credit a longer sequence receives\.

#### Batch\-Level Baselines\.

All weighting schemes can be paired with either RLOO or GRPO\-style prompt\-level baselines\. Unless otherwise specified, we use the same rollout groups, optimizer, sampling hyperparameters, and baseline family across methods\. This isolates token redistribution as the only algorithmic difference\.

### 3\.4Signal Degeneration Under Low\-Rank Adaptation

So far our exposition has treated the policyπθ\\pi\_\{\\theta\}as an unconstrained language model being updated by gradient descent\. In practice, however, many LLM\-RL pipelines apply LoRA\(Huet al\.,[2022](https://arxiv.org/html/2606.00257#bib.bib37)\), which restrictsπθ\\pi\_\{\\theta\}to a rank\-rrperturbation of a frozen reference policyπref\\pi\_\{\\mathrm\{ref\}\}\. We now show that the intrinsic weighting schemes defined above are qualitatively degraded by this constraint\.

We use*collapse*to mean degeneration of the normalized token weights, not only convergence to the perfectly uniform distribution\. Depending on the raw score scale, theε\\varepsilonfloor, and any sharpening transform, a degraded signal can appear either as uniform broadcast or as spuriously sparse credit concentrated on a tiny set of positions\. In both cases, the weights no longer reflect where the adapter has learned to act\.

#### Notation\.

Write the logits produced by the base model at positionttasztref=Wlm​htbasez\_\{t\}^\{\\mathrm\{ref\}\}=W\_\{\\mathrm\{lm\}\}h\_\{t\}^\{\\mathrm\{base\}\}and the logits produced by the adapted model asztθ=Wlm​htadaptedz\_\{t\}^\{\\theta\}=W\_\{\\mathrm\{lm\}\}h\_\{t\}^\{\\mathrm\{adapted\}\}, wherehtadapted=htbase\+Δ​hth\_\{t\}^\{\\mathrm\{adapted\}\}=h\_\{t\}^\{\\mathrm\{base\}\}\+\\Delta h\_\{t\}andΔ​ht\\Delta h\_\{t\}is the sum of LoRA adapter contributions propagated through the transformer to positiontt\. Each adapter contributesBℓ​Aℓ​xℓ,tB\_\{\\ell\}A\_\{\\ell\}x\_\{\\ell,t\}at layerℓ\\ell, whereBℓ∈ℝd×rB\_\{\\ell\}\\in\\mathbb\{R\}^\{d\\times r\}andAℓ∈ℝr×dA\_\{\\ell\}\\in\\mathbb\{R\}^\{r\\times d\}, so the raw per\-layer perturbation at each position lies in the fixedrr\-dimensional column space ofBℓB\_\{\\ell\}\.

#### Degeneration of surprisal and entropy\.

Because‖Δ​ht‖\\\|\\Delta h\_\{t\}\\\|is bounded by a product of adapter norms and input activation norms that are themselves roughly stationary across positions in a well\-trained base model, the first\-order perturbation to per\-token surprisal,

−log⁡πθ​\(yt∣y<t,x\)=−log⁡πref​\(yt∣y<t,x\)\+O​\(‖Δ​ht‖\),\-\\log\\pi\_\{\\theta\}\(y\_\{t\}\\mid y\_\{<t\},x\)=\-\\log\\pi\_\{\\mathrm\{ref\}\}\(y\_\{t\}\\mid y\_\{<t\},x\)\+O\(\\\|\\Delta h\_\{t\}\\\|\),is dominated by the base model’s surprisal pattern rather than by the LoRA update\. The same holds for predictive entropy\. Consequently, when we normalize these scores within a trajectory \(equation \([8](https://arxiv.org/html/2606.00257#S3.E8)\)\), the resulting weights can be non\-uniform, but their non\-uniformity is not an adapter\-aware credit signal; it mostly reflects base\-model uncertainty or the numerics of the normalization\.

#### Degeneration of policy divergence\.

Divergence weighting,αtdiv=\|log⁡πθ​\(yt\)−log⁡πref​\(yt\)\|\\alpha\_\{t\}^\{\\mathrm\{div\}\}=\\lvert\\log\\pi\_\{\\theta\}\(y\_\{t\}\)\-\\log\\pi\_\{\\mathrm\{ref\}\}\(y\_\{t\}\)\\rvert, is of direct interest because it quantifies exactly the policy change that the RL update is trying to drive\. Under LoRA, however, this quantity is small and approximately uniform across positions for a different reason: the KL budgetKL​\(πθ∥πref\)\\mathrm\{KL\}\(\\pi\_\{\\theta\}\\\|\\pi\_\{\\mathrm\{ref\}\}\)is bounded by the rank\-rradapter’s capacity, so the per\-token log\-probability ratios are compressed toward zero\(Anonymous,[2026](https://arxiv.org/html/2606.00257#bib.bib41)\)\. Under the within\-trajectory normalization in equation \([8](https://arxiv.org/html/2606.00257#S3.E8)\), this compression is pathological\. If the score is approximately constant and theε\\varepsilonfloor dominates, the weights approach uniform broadcast\. If the floor is negligible, the same tiny differences can be amplified into a spurious sparse distribution\.

#### Consequence\.

We summarize the practical implication as follows\. Let the Gini coefficientG​\(w\)G\(w\)and the effective\-token ratioEffN​\(w\)/T=1/\(T​∑twt2\)\\mathrm\{EffN\}\(w\)/T=1/\(T\\sum\_\{t\}w\_\{t\}^\{2\}\)be standard concentration measures, withG​\(w\)=0G\(w\)=0andEffN/T=1\\mathrm\{EffN\}/T=1corresponding to perfectly uniform weights\. For surprisal, entropy\-reduction, and divergence weighting under LoRA, these metrics reveal whether the normalized weights have become degenerate: uniform broadcast givesG​\(w\)≈0G\(w\)\\approx 0, while spuriously sparse credit givesG​\(w\)≈1G\(w\)\\approx 1and smallEffN/T\\mathrm\{EffN\}/T\. Both outcomes are failures of adapter\-aware credit assignment\. This is*not*a problem with the underlying signals in a fully trainable policy; it is a structural consequence of applying output\-distribution signals inside a low\-rank adaptation\.

### 3\.5Adapter\-Residual Credit Assignment \(ARCA\)

The degeneration result motivates a different design choice\. Rather than measuring per\-token properties of the*output distribution*\(whose shape is dominated by the frozen base model under LoRA\), we measure per\-token properties of the*adapter’s own contribution*to the forward pass\.

#### Definition\.

Given a model with a LoRA adapter, lethtadaptedh\_\{t\}^\{\\mathrm\{adapted\}\}denote the last\-layer hidden state at positionttwith the adapter enabled, andhtbaseh\_\{t\}^\{\\mathrm\{base\}\}the same hidden state with the adapter disabled\. Define the Adapter\-Residual Credit Assignment salience as

αtARCA​\(x,y\)=∥htadapted−htbase∥2,\\alpha\_\{t\}^\{\\mathrm\{ARCA\}\}\(x,y\)=\\lVert h\_\{t\}^\{\\mathrm\{adapted\}\}\-h\_\{t\}^\{\\mathrm\{base\}\}\\rVert\_\{2\},\(15\)and normalize within the trajectory as in equation \([8](https://arxiv.org/html/2606.00257#S3.E8)\) to obtain per\-token weightswtARCAw\_\{t\}^\{\\mathrm\{ARCA\}\}\. The adapter residual is computed once per minibatch via an extra no\-grad forward pass with the adapter disabled, usingmodel\.disable\_adapter\(\)in frameworks such as PEFT\. This is exactly the same call already required to compute a KL regularizer against the reference policy or to support divergence weighting, so ARCA adds no new infrastructure\.

#### Why ARCA avoids output\-signal collapse\.

The key property that distinguishes ARCA from the intrinsic schemes above is that it measures a quantity whose*per\-token*value is actively shaped by the attention and MLP pathways through which the adapter input activations are routed\. Even for a low\-rank adapter with small‖Bℓ​Aℓ‖\\\|B\_\{\\ell\}A\_\{\\ell\}\\\|, the input activationsxℓ,tx\_\{\\ell,t\}are non\-uniform across positions \(attention patterns, layer norms, and content\-versus\-filler distinctions are already non\-uniform in the base model\), so‖Δ​ht‖\\\|\\Delta h\_\{t\}\\\|retains nontrivial position\-to\-position variation\. With a fixedε\\varepsilonfloor, any score whose magnitude vanishes completely will eventually become uniform after normalization\. ARCA’s claim is therefore not that an infinitesimal adapter can defeat the floor; it is that, while the adapter residual is measurable, its normalized variation is tied to where the adapter actually changes the hidden state rather than to small output\-distribution perturbations\.

#### Interpretation as implicit gradient\-informed credit\.

ARCA can be read as a cheap proxy for a gradient\-based notion of token importance\. Positions at which the adapter contribution is large are positions at which the adapter’s parameter gradient has high contribution to the loss, and therefore positions where a policy\-gradient update can have the most effect\. This is a substantially weaker statement than the one made by a learned critic, but it has the advantage of being available for free from quantities the adapted forward pass already computes, and of being exactly targeted at the PEFT\-specific failure mode introduced above\.

## 4Experiments

We evaluate whether adapter\-residual credit assignment gives a useful token\-level signal in the low\-rank fine\-tuning regime\. The empirical section focuses on a single controlled sweep, summarized in Table[1](https://arxiv.org/html/2606.00257#S4.T1)\. It uses one model, one task, one advantage estimator, and seven weighting configurations\. This design trades breadth for a cleaner test of the central claim: ARCA changes credit assignment without adding trainable parameters beyond the LoRA adapter used by the matched baselines\.

### 4\.1Setup

#### Model and task\.

We fine\-tuneQwen/Qwen3\-1\.7B\-Baseon the MATH training split and evaluate on held\-out MATH examples\. Prompts ask the model to solve the problem step by step and place the final answer in\\boxed\{\}\. Rewards are binary exact\-match scores computed from the final boxed answer after normalization\.

#### Training\.

All runs use GRPO prompt\-level advantages, LoRA adapters on the attention and MLP projections, and the same rollout budget, optimizer, learning\-rate schedule, prompt format, and decoding parameters\. Each training step samplesK=4K=4completions per prompt, uses 100 update steps, and evaluates greedy accuracy and pass@4 on 200 held\-out MATH examples\. We use a single seed \(1337\)\. We save checkpoints every 20 steps\. We use AdamW with learning rate5×10−65\\times 10^\{\-6\}, zero weight decay, a 10\-step warmup followed by cosine decay, batch size 2, gradient accumulation over 4 microbatches, maximum prompt length 512, maximum generation length 512, temperature 1\.0, and top\-pp0\.95 sampling during training\.

#### Methods\.

The sweep contains seven runs\. Four baselines use LoRA rankr=64r=64: uniform token weighting, surprisal weighting, entropy\-reduction weighting, and policy\-divergence weighting\. The proposed method, ARCA, is run atr∈\{4,16,64\}r\\in\\\{4,16,64\\\}\. Ther=64r=64comparison tests ARCA against baselines with the same trainable parameter count\. Ther=4r=4andr=16r=16comparisons test whether ARCA remains competitive with fewer adapter parameters than ther=64r=64baselines\.

Table 1:Reported sweep\.All runs useQwen/Qwen3\-1\.7B\-Base, MATH, GRPO, one seed, and 100 training steps\. The non\-ARCA baselines are matched at LoRA rank 64; ARCA is evaluated at three ranks to separate credit assignment from adapter capacity\.

### 4\.2Main Results on MATH

Table[2](https://arxiv.org/html/2606.00257#S4.T2)reports held\-out MATH performance after RL fine\-tuning\. The primary comparison is ARCA at rank 64 versus the rank\-64 baselines, which holds the trainable parameter count fixed\. The rank\-4 and rank\-16 ARCA runs provide a more stringent control: if lower\-rank ARCA is competitive with rank\-64 baselines, the gain cannot be attributed simply to using a larger adapter\.

Table 2:Held\-out MATH performance\.Greedy accuracy and pass@4 after GRPO fine\-tuning\. ARCA at rank 64 is parameter\-matched to the rank\-64 baselines; lower\-rank ARCA rows test the parameter\-count alternative\. Train reward is averaged over the 100 update steps\.Table[2](https://arxiv.org/html/2606.00257#S4.T2)should be read as a compact validation rather than a broad benchmark\. Uniform weighting obtains the highest greedy accuracy, while entropy reduction obtains the highest pass@4\. ARCA at rank 64 is competitive with these baselines under the same trainable parameter count, trailing uniform greedy accuracy by three points and exceeding uniform pass@4 by 0\.5 points\. The lower\-rank ARCA variants do not outperform the rank\-64 baselines in this run, which is consistent with a capacity\-performance tradeoff rather than evidence that ARCA gains come from extra parameters\.

### 4\.3Credit\-Assignment Diagnostics

Performance alone does not show whether a method changes token\-level credit assignment\. We therefore log the concentration of token weights during training\. We report the Gini coefficientG​\(w\)G\(w\)and effective\-token ratioEffN​\(w\)/T\\mathrm\{EffN\}\(w\)/T\. Uniform weighting hasG=0G=0andEffN/T=1\\mathrm\{EffN\}/T=1; highly concentrated weights haveGGnear one and a small effective\-token ratio\.

Table 3:Token\-weight diagnostics\.Concentration statistics averaged over training\. These diagnostics test whether a method produces a non\-uniform but non\-degenerate token\-credit signal, the empirical observable predicted by the degeneration analysis\.Table[3](https://arxiv.org/html/2606.00257#S4.T3)is the central diagnostic result\. Uniform weighting is exactly flat\. Surprisal, entropy reduction, and policy divergence produce extremely concentrated distributions, with effective\-token ratios between 0\.052 and 0\.073 on average\. These methods therefore do not collapse to uniform in this implementation; they degenerate in the other direction, into spuriously sparse credit\. ARCA occupies the predicted middle regime: it is far from uniform, but it does not concentrate credit onto the tiny effective support used by the output\-distribution baselines\. This is the empirical signature of the theoretical distinction\. Output\-distribution scores ask where the model is uncertain or shifted; ARCA asks where the adapter actually acts\.

### 4\.4Results and Discussion

The results support a theory\-first interpretation\. The downstream accuracy numbers do not show a universal ARCA win: uniform is best under greedy decoding, and entropy reduction is best under pass@4\. However, ARCA at rank 64 is competitive under matched parameters, and its token\-credit distribution is qualitatively different from every baseline\. In this paper, that diagnostic is not secondary\. It is the observable predicted by the LoRA\-degeneration analysis: a method whose salience is measured through output distributions can broadcast credit uniformly or concentrate it on a very small set of positions, while adapter\-residual salience remains position\-discriminative without becoming maximally sparse\.

This distinction matters because token\-level credit assignment is often evaluated only by final task accuracy\. The concentration diagnostics separate these cases\. ARCA demonstrates that one can obtain an adaptation\-aware credit signal from the LoRA adapter itself, with no value head, learned process reward model, or tree construction, while keeping the trainable parameter count identical to the rank\-64 baselines\.

The rank controls provide a useful boundary on the claim\. ARCA at ranks 4 and 16 uses substantially fewer trainable parameters than the rank\-64 baselines, and in this sweep those lower\-rank variants do not match rank\-64 performance\. We therefore do not claim that ARCA removes the need for sufficient adapter capacity\. The supported claim is sharper: at a fixed adapter rank, ARCA changes the geometry of token credit in the way predicted by the theory, and does so while remaining competitive on held\-out MATH\.

### 4\.5Implementation and Metrics

For each minibatch, training samples completions from the current policy, computes binary rewards after full generation, constructs GRPO advantages, and applies the weighted token\-level objective from Section[3](https://arxiv.org/html/2606.00257#S3)\. Surprisal uses current\-policy token log probabilities, entropy\-reduction uses next\-token entropy drops, policy\-divergence compares adapted and adapter\-disabled log probabilities, and ARCA weights tokens by adapter\-residual norms in the final hidden layer\.

The code records per\-step reward mean, reward variance, loss, completion length, token\-weight Gini, effective\-token ratio, and, for ARCA, the mean and maximum adapter\-residual norm\. Final evaluation records greedy accuracy and pass@4 on held\-out MATH examples\. These metrics allow us to distinguish three claims: whether the method trained, whether it changed token\-credit structure, and whether that structure improved held\-out performance\.

### 4\.6Compute

Each run fits on a single H100\-class GPU\. The seven submission\-time runs are executed independently, one configuration per GPU, with wall\-clock training time on the order of a few hours per run plus final evaluation\. The reported sweep therefore requires seven single\-GPU runs; preliminary development runs were used to tune runtime and checkpointing but are not included in the reported comparison\.

### 4\.7Broader Impacts

This work is methodological and studies how to assign token\-level credit during reinforcement learning for language models\. The potential positive impact is improved sample efficiency and interpretability of RL fine\-tuning, especially in settings where practitioners already use parameter\-efficient adaptation\. The same techniques could also improve the fine\-tuning of models for harmful or misleading generation if applied without appropriate task, data, and deployment safeguards\. We do not release a new general\-purpose model in this paper\.

### 4\.8Limitations

This sweep is deliberately small: one model, one dataset, one seed, one advantage estimator, and seven configurations\. It is sufficient to test the parameter\-matched ARCA comparison and the low\-rank control, but the reported numbers are descriptive rather than statistical significance claims\. We view the results as a targeted validation of the mechanism rather than a broad benchmark claim\.

## 5Conclusion

Token\-level credit assignment and parameter\-efficient fine\-tuning are usually studied as separate design choices\. This paper argues that they are coupled\. Under LoRA, output\-distribution signals such as surprisal, entropy reduction, and policy divergence can degenerate into token weights that no longer track where the adapter acts\. ARCA addresses this failure mode by measuring salience where LoRA actually acts: in the adapter\-induced hidden\-state residual\. The resulting signal is lightweight, requires no learned critic or learned process reward model, and produces non\-degenerate token\-credit distributions in the regime where output\-distribution baselines become either flat or highly concentrated\. Our MATH/Qwen3\-1\.7B sweep is intentionally compact, but it supports the central mechanism: adaptation\-aware credit assignment changes the geometry of token updates under matched rollout and parameter budgets\.

## Acknowledgements

We thank Scale AI for providing the compute resources used in this work\.

## References

- A\. Ahmadian, C\. Cremer, M\. Gallé, M\. Fadaee, J\. Kreutzer, O\. Pietquin, A\. Üstün, and S\. Hooker \(2024\)Back to basics: revisiting REINFORCE\-style optimization for learning from human feedback in LLMs\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Cited by:[§A\.2](https://arxiv.org/html/2606.00257#A1.SS2.p2.1),[§2\.2](https://arxiv.org/html/2606.00257#S2.SS2.p1.1)\.
- Anonymous \(2026\)LoRA as an implicit KL regularizer in GRPO fine\-tuning: from theory to practice\.Note:Under review at Transactions on Machine Learning ResearchCited by:[§2\.1](https://arxiv.org/html/2606.00257#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2606.00257#S2.SS2.p1.1),[§3\.4](https://arxiv.org/html/2606.00257#S3.SS4.SSS0.Px3.p1.4)\.
- M\. Cao, S\. Zhang, X\. Chang, and D\. Precup \(2025\)SCAR: shapley credit assignment for more efficient RLHF\.External Links:2505\.20417Cited by:[§A\.4](https://arxiv.org/html/2606.00257#A1.SS4.p3.2),[§2\.2](https://arxiv.org/html/2606.00257#S2.SS2.p1.1)\.
- Y\. Chai, H\. Sun, H\. Fang, S\. Wang, Y\. Sun, and H\. Wu \(2025\)MA\-RLHF: reinforcement learning from human feedback with macro actions\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§A\.5](https://arxiv.org/html/2606.00257#A1.SS5.p1.1)\.
- H\. Chen, T\. Yang, S\. Gao, R\. Chen, X\. Quan, H\. Tian, and T\. Yao \(2025\)Discriminative policy optimization for token\-level reward models\.InProceedings of the 42nd International Conference on Machine Learning,A\. Singh, M\. Fazel, D\. Hsu, S\. Lacoste\-Julien, F\. Berkenkamp, T\. Maharaj, K\. Wagstaff, and J\. Zhu \(Eds\.\),Proceedings of Machine Learning Research, Vol\.267,pp\. 9546–9565\.Cited by:[§A\.4](https://arxiv.org/html/2606.00257#A1.SS4.p1.1)\.
- P\. F\. Christiano, J\. Leike, T\. B\. Brown, M\. Martic, S\. Legg, and D\. Amodei \(2017\)Deep reinforcement learning from human preferences\.InAdvances in Neural Information Processing Systems 30,Cited by:[§A\.1](https://arxiv.org/html/2606.00257#A1.SS1.p1.1)\.
- G\. Cui, L\. Yuan, Z\. Wang, H\. Wang, Y\. Zhang, J\. Chen, W\. Li, B\. He, Y\. Fan, T\. Yu, Q\. Xu, W\. Chen, J\. Yuan, H\. Chen, K\. Zhang, X\. Lv, S\. Wang, Y\. Yao, X\. Han, H\. Peng, Y\. Cheng, Z\. Liu, M\. Sun, B\. Zhou, and N\. Ding \(2025\)Process reinforcement through implicit rewards\.External Links:2502\.01456Cited by:[§A\.4](https://arxiv.org/html/2606.00257#A1.SS4.p1.1),[§1](https://arxiv.org/html/2606.00257#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.00257#S2.SS2.p1.1)\.
- DeepSeek\-AI \(2025\)DeepSeek\-r1: incentivizing reasoning capability in LLMs via reinforcement learning\.External Links:2501\.12948Cited by:[§A\.1](https://arxiv.org/html/2606.00257#A1.SS1.p2.1)\.
- Y\. He, H\. Wu, S\. Liu, H\. Ge, H\. Zhou, K\. Wu, Z\. Zheng, Q\. Lin, Z\. Zhong, and Y\. Zhang \(2026\)Rethinking token\-level credit assignment in RLVR: a polarity\-entropy analysis\.External Links:2604\.11056Cited by:[§A\.3](https://arxiv.org/html/2606.00257#A1.SS3.p2.1),[§B\.1](https://arxiv.org/html/2606.00257#A2.SS1.p1.1),[§1](https://arxiv.org/html/2606.00257#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.00257#S2.SS2.p1.1)\.
- F\. Helm, N\. Daheim, and I\. Gurevych \(2025\)Token weighting for long\-range language modeling\.InFindings of the Association for Computational Linguistics: NAACL 2025,L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 1440–1459\.Cited by:[§A\.5](https://arxiv.org/html/2606.00257#A1.SS5.p2.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.00257#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.00257#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2606.00257#S2.SS2.p1.1),[§3\.4](https://arxiv.org/html/2606.00257#S3.SS4.p1.4)\.
- J\. Hu, J\. K\. Liu, H\. Xu, and W\. Shen \(2025\)REINFORCE\+\+: stabilizing critic\-free policy optimization with global advantage normalization\.External Links:2501\.03262Cited by:[§A\.2](https://arxiv.org/html/2606.00257#A1.SS2.p2.1)\.
- M\. Hu, B\. Wang, S\. Hu, R\. Wang, X\. Wang, X\. Guo, D\. Zha, and J\. Xiao \(2026\)PSPO: trainable potential\-based reward shaping with internal model signals for post\-training policy optimization of large language models\.Note:ICLR 2026 submission, rejectedCited by:[§A\.4](https://arxiv.org/html/2606.00257#A1.SS4.p3.2),[§2\.2](https://arxiv.org/html/2606.00257#S2.SS2.p1.1)\.
- A\. Kazemnejad, M\. Aghajohari, E\. Portelance, A\. Sordoni, S\. Reddy, A\. Courville, and N\. Le Roux \(2024\)VinePPO: refining credit assignment in RL training of LLMs\.External Links:2410\.01679Cited by:[§A\.2](https://arxiv.org/html/2606.00257#A1.SS2.p1.1),[§A\.4](https://arxiv.org/html/2606.00257#A1.SS4.p2.1),[§1](https://arxiv.org/html/2606.00257#S1.p1.1)\.
- A\. Lee and H\. Tong \(2025\)Token\-efficient RL for LLM reasoning\.External Links:2504\.20834Cited by:[§1](https://arxiv.org/html/2606.00257#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.00257#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2606.00257#S2.SS2.p1.1)\.
- J\. Li, L\. Li, T\. Chang, K\. Kuang, L\. Chen, J\. Zhou, and C\. Yang \(2024a\)RED: unleashing token\-level rewards from holistic feedback via reward redistribution\.External Links:2411\.08302Cited by:[§A\.4](https://arxiv.org/html/2606.00257#A1.SS4.p3.2),[§1](https://arxiv.org/html/2606.00257#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.00257#S2.SS2.p1.1)\.
- S\. Li, X\. Luo, H\. Wang, X\. Tang, Z\. Cui, D\. Liu, Y\. Li, X\. He, and R\. Li \(2025\)Beyond higher rank: token\-wise input\-output projections for efficient low\-rank adaptation\.External Links:2510\.23123Cited by:[§2\.1](https://arxiv.org/html/2606.00257#S2.SS1.p2.1),[§2\.2](https://arxiv.org/html/2606.00257#S2.SS2.p1.1)\.
- Y\. Li, J\. Xu, Z\. Li, J\. Liu, W\. Liu, Y\. Tong, L\. Zheng, Z\. Xue, Y\. Zhang, T\. Cai, G\. Zhang, Q\. Liu, and B\. Wang \(2026\)The optimal token baseline: variance reduction for long\-horizon LLM\-RL\.External Links:2602\.07078Cited by:[§A\.4](https://arxiv.org/html/2606.00257#A1.SS4.p2.1),[§A\.4](https://arxiv.org/html/2606.00257#A1.SS4.p3.2),[§2\.2](https://arxiv.org/html/2606.00257#S2.SS2.p1.1)\.
- Z\. Li, T\. Xu, Y\. Zhang, Z\. Lin, Y\. Yu, R\. Sun, and Z\. Luo \(2024b\)ReMax: a simple, effective, and efficient reinforcement learning method for aligning large language models\.InProceedings of the 41st International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.235\.Cited by:[§A\.2](https://arxiv.org/html/2606.00257#A1.SS2.p2.1),[§2\.2](https://arxiv.org/html/2606.00257#S2.SS2.p1.1)\.
- H\. Meng, K\. Huang, S\. Wei, C\. Ma, S\. Yang, X\. Wang, G\. Wang, B\. Ding, and J\. Zhou \(2026\)Sparse but critical: a token\-level analysis of distributional shifts in RLVR fine\-tuning of LLMs\.External Links:2603\.22446Cited by:[§A\.3](https://arxiv.org/html/2606.00257#A1.SS3.p2.1),[§B\.1](https://arxiv.org/html/2606.00257#A2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2606.00257#S2.SS2.p1.1)\.
- J\. Minder, C\. Dumas, S\. Slocum, H\. Casademunt, C\. Holmes, R\. West, and N\. Nanda \(2025\)Narrow fine\-tuning leaves clearly readable traces in activation differences\.External Links:2510\.13900Cited by:[§2\.1](https://arxiv.org/html/2606.00257#S2.SS1.p2.1),[§2\.2](https://arxiv.org/html/2606.00257#S2.SS2.p1.1)\.
- Y\. Mroueh, N\. Dupuis, B\. Belgodere, A\. Nitsure, M\. Rigotti, K\. Greenewald, J\. Navratil, J\. Ross, and J\. Rios \(2025\)Revisiting group relative policy optimization: insights into on\-policy and off\-policy training\.External Links:2505\.22257Cited by:[§A\.2](https://arxiv.org/html/2606.00257#A1.SS2.p2.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. L\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray, J\. Schulman, J\. Hilton, F\. Kelton, L\. Miller, M\. Simens, A\. Askell, P\. Welinder, P\. Christiano, J\. Leike, and R\. Lowe \(2022\)Training language models to follow instructions with human feedback\.InAdvances in Neural Information Processing Systems 35,Cited by:[§A\.1](https://arxiv.org/html/2606.00257#A1.SS1.p1.1)\.
- P\. Parthasarathi, M\. Reymond, B\. Chen, Y\. Cui, and S\. Chandar \(2025\)GRPO\-λ\\lambda: credit assignment improves llm reasoning\.External Links:2510\.00194Cited by:[§A\.4](https://arxiv.org/html/2606.00257#A1.SS4.p3.2),[§2\.2](https://arxiv.org/html/2606.00257#S2.SS2.p1.1)\.
- J\. Schulman, S\. Levine, P\. Abbeel, M\. Jordan, and P\. Moritz \(2015\)Trust region policy optimization\.InProceedings of the 32nd International Conference on Machine Learning,F\. Bach and D\. Blei \(Eds\.\),Proceedings of Machine Learning Research, Vol\.37,Lille, France,pp\. 1889–1897\.Cited by:[§A\.1](https://arxiv.org/html/2606.00257#A1.SS1.p1.1)\.
- J\. Schulman, P\. Moritz, S\. Levine, M\. Jordan, and P\. Abbeel \(2016\)High\-dimensional continuous control using generalized advantage estimation\.InInternational Conference on Learning Representations,Cited by:[§A\.1](https://arxiv.org/html/2606.00257#A1.SS1.p1.1)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.External Links:1707\.06347Cited by:[§A\.1](https://arxiv.org/html/2606.00257#A1.SS1.p1.1)\.
- Z\. Shan, H\. Zhong, L\. Wang, and L\. Zhao \(2026\)Bringing value models back: generative critics for value modeling in LLM reinforcement learning\.External Links:2604\.10701Cited by:[§A\.2](https://arxiv.org/html/2606.00257#A1.SS2.p1.1),[§A\.4](https://arxiv.org/html/2606.00257#A1.SS4.p2.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo \(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.External Links:2402\.03300Cited by:[§A\.1](https://arxiv.org/html/2606.00257#A1.SS1.p2.1),[§A\.2](https://arxiv.org/html/2606.00257#A1.SS2.p2.1),[§2\.2](https://arxiv.org/html/2606.00257#S2.SS2.p1.1)\.
- L\. S\. Shapley \(1953\)A value for n\-person games\.InContributions to the Theory of Games II,H\. W\. Kuhn and A\. W\. Tucker \(Eds\.\),pp\. 307–317\.Cited by:[§A\.4](https://arxiv.org/html/2606.00257#A1.SS4.p3.2)\.
- N\. Stiennon, L\. Ouyang, J\. Wu, D\. M\. Ziegler, R\. Lowe, C\. Voss, A\. Radford, D\. Amodei, and P\. F\. Christiano \(2020\)Learning to summarize from human feedback\.InAdvances in Neural Information Processing Systems 33,Cited by:[§A\.1](https://arxiv.org/html/2606.00257#A1.SS1.p1.1)\.
- H\. Tran, Z\. Yao, and H\. Yu \(2025\)Exploiting tree structure for credit assignment in RL training of LLMs\.External Links:2509\.18314Cited by:[§A\.4](https://arxiv.org/html/2606.00257#A1.SS4.p3.2),[§1](https://arxiv.org/html/2606.00257#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.00257#S2.SS2.p1.1)\.
- S\. Wang, J\. Asilis, Ö\. F\. Akgül, E\. B\. Bilgin, O\. Liu, and W\. Neiswanger \(2025a\)Tina: tiny reasoning models via LoRA\.External Links:2504\.15777Cited by:[§1](https://arxiv.org/html/2606.00257#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.00257#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2606.00257#S2.SS2.p1.1)\.
- S\. Wang, L\. Yu, C\. Gao, C\. Zheng, S\. Liu, R\. Lu, K\. Dang, X\. Chen, J\. Yang, Z\. Zhang, Y\. Liu, A\. Yang, A\. Zhao, Y\. Yue, S\. Song, B\. Yu, G\. Huang, and J\. Lin \(2025b\)Beyond the 80/20 rule: high\-entropy minority tokens drive effective reinforcement learning for LLM reasoning\.External Links:2506\.01939Cited by:[§A\.3](https://arxiv.org/html/2606.00257#A1.SS3.p2.1),[§A\.5](https://arxiv.org/html/2606.00257#A1.SS5.p2.1),[§1](https://arxiv.org/html/2606.00257#S1.p1.1)\.
- X\. Wen, Z\. Liu, S\. Zheng, S\. Ye, Z\. Wu, Y\. Wang, Z\. Xu, X\. Liang, J\. Li, Z\. Miao, J\. Bian, and M\. Yang \(2025\)Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base LLMs\.External Links:2506\.14245Cited by:[§A\.3](https://arxiv.org/html/2606.00257#A1.SS3.p1.1)\.
- R\. J\. Williams \(1992\)Simple statistical gradient\-following algorithms for connectionist reinforcement learning\.Machine Learning8,pp\. 229–256\.Cited by:[§A\.1](https://arxiv.org/html/2606.00257#A1.SS1.p1.1)\.
- W\. Xiong, J\. Yao, Y\. Xu, B\. Pang, L\. Wang, D\. Sahoo, J\. Li, N\. Jiang, T\. Zhang, C\. Xiong, and H\. Dong \(2025\)A minimalist approach to LLM reasoning: from rejection sampling to reinforce\.External Links:2504\.11343Cited by:[§A\.3](https://arxiv.org/html/2606.00257#A1.SS3.p1.1)\.
- S\. Yang, S\. Zhang, C\. Xia, Y\. Feng, C\. Xiong, and M\. Zhou \(2023\)Preference\-grounded token\-level guidance for language model fine\-tuning\.InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10–16, 2023,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),Cited by:[§A\.4](https://arxiv.org/html/2606.00257#A1.SS4.p1.1)\.
- Q\. Yin, Y\. Wu, Z\. Shen, S\. Li, Z\. Wang, Y\. Li, C\. T\. Leong, J\. Kang, and J\. Gu \(2025\)Evaluating parameter efficient methods for RLVR\.External Links:2512\.23165Cited by:[§2\.1](https://arxiv.org/html/2606.00257#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2606.00257#S2.SS2.p1.1)\.
- H\. S\. Yoon, E\. Yoon, M\. A\. Hasegawa\-Johnson, S\. Kim, and C\. D\. Yoo \(2025\)ConfPO: exploiting policy model confidence for critical token selection in preference optimization\.InProceedings of the 42nd International Conference on Machine Learning,A\. Singh, M\. Fazel, D\. Hsu, S\. Lacoste\-Julien, F\. Berkenkamp, T\. Maharaj, K\. Wagstaff, and J\. Zhu \(Eds\.\),Proceedings of Machine Learning Research, Vol\.267,pp\. 72641–72655\.Cited by:[§A\.5](https://arxiv.org/html/2606.00257#A1.SS5.p2.1)\.
- S\. Yu, L\. Li, W\. Zhao, and Z\. Yang \(2026\)ERPO: token\-level entropy\-regulated policy optimization for large reasoning models\.External Links:2603\.28204Cited by:[§A\.3](https://arxiv.org/html/2606.00257#A1.SS3.p2.1),[§B\.1](https://arxiv.org/html/2606.00257#A2.SS1.p1.1),[§1](https://arxiv.org/html/2606.00257#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.00257#S2.SS2.p1.1)\.
- C\. Zhang, Y\. Deng, X\. Lin, B\. Wang, D\. Ng, H\. Ye, X\. Li, Y\. Xiao, Z\. Mo, Q\. Zhang, and L\. Bing \(2025\)100 days after deepseek\-r1: a survey on replication studies and more directions for reasoning language models\.External Links:2505\.00551Cited by:[§A\.1](https://arxiv.org/html/2606.00257#A1.SS1.p2.1)\.
- C\. Zheng, S\. Liu, M\. Li, X\. Chen, B\. Yu, C\. Gao, K\. Dang, Y\. Liu, R\. Men, A\. Yang, J\. Zhou, and J\. Lin \(2025\)Group sequence policy optimization\.External Links:2507\.18071Cited by:[§A\.2](https://arxiv.org/html/2606.00257#A1.SS2.p2.1),[§A\.5](https://arxiv.org/html/2606.00257#A1.SS5.p1.1),[§2\.2](https://arxiv.org/html/2606.00257#S2.SS2.p1.1)\.
- H\. Zhong, Z\. Shan, G\. Feng, W\. Xiong, X\. Cheng, L\. Zhao, D\. He, J\. Bian, and L\. Wang \(2025\)DPO meets PPO: reinforced token optimization for RLHF\.InProceedings of the 42nd International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.267,pp\. 78498–78521\.Cited by:[§A\.4](https://arxiv.org/html/2606.00257#A1.SS4.p1.1)\.

## Appendix AExtended Related Work

### A\.1From RLHF to RLVR

Modern language\-model post\-training inherits its optimization machinery from policy\-gradient reinforcement learning, from REINFORCE\(Williams,[1992](https://arxiv.org/html/2606.00257#bib.bib1)\)to trust\-region and clipped\-surrogate methods such as TRPO and PPO\(Schulmanet al\.,[2015](https://arxiv.org/html/2606.00257#bib.bib2),[2017](https://arxiv.org/html/2606.00257#bib.bib4)\)\. Advantage estimation, especially GAE, became the standard variance\-reduction device in continuous\-control RL by learning value functions over states or prefixes\(Schulmanet al\.,[2016](https://arxiv.org/html/2606.00257#bib.bib3)\)\. In language modeling, these ideas entered mainstream use through RLHF systems that optimize sequence\-level rewards derived from human preferences or reward models\(Christianoet al\.,[2017](https://arxiv.org/html/2606.00257#bib.bib5); Stiennonet al\.,[2020](https://arxiv.org/html/2606.00257#bib.bib6); Ouyanget al\.,[2022](https://arxiv.org/html/2606.00257#bib.bib7)\)\.

The recent reasoning\-model wave shifted part of the field from preference\-based RLHF toward reinforcement learning with verifiable rewards \(RLVR\), where supervision is often a deterministic correctness signal on math or code tasks\. DeepSeekMath introduced GRPO as a practical critic\-free alternative for these settings\(Shaoet al\.,[2024](https://arxiv.org/html/2606.00257#bib.bib11)\), and DeepSeek\-R1 popularized large\-scale RL\-only or RL\-dominant reasoning pipelines\(DeepSeek\-AI,[2025](https://arxiv.org/html/2606.00257#bib.bib23)\)\. Subsequent surveys document how quickly this regime became the default template for open reasoning\-model replication and extension\(Zhanget al\.,[2025](https://arxiv.org/html/2606.00257#bib.bib24)\)\. This transition matters for our setting because RLVR makes sparse outcome rewards and long reasoning trajectories central, thereby exposing the credit\-assignment problem more directly than earlier short\-form RLHF tasks\.

### A\.2Critic\-Based and Critic\-Free Policy Optimization

The classical solution to delayed reward is to learn a critic or value function and convert returns into token\- or prefix\-level advantages\. PPO\-style RLHF pipelines follow this recipe, but the value\-estimation problem is particularly awkward for text generation because prefixes are high\-dimensional, non\-Markov, and semantically heterogeneous\. VinePPO demonstrated that standard value heads produce poor estimates of expected returns for reasoning tasks and proposed Monte Carlo vine\-style rollout estimates as a more accurate alternative\(Kazemnejadet al\.,[2024](https://arxiv.org/html/2606.00257#bib.bib20)\)\. GenAC extends this idea by replacing scalar value prediction with a generative critic that uses chain\-of\-thought reasoning for value estimation\(Shanet al\.,[2026](https://arxiv.org/html/2606.00257#bib.bib36)\)\.

A growing body of work questions whether value heads are necessary at all in LLM post\-training\. ReMax replaces learned critics with a greedy baseline and shows that much of PPO’s practical benefit can be retained with a simpler REINFORCE\-style objective\(Liet al\.,[2024b](https://arxiv.org/html/2606.00257#bib.bib9)\)\. Back to Basics systematically compares PPO with RLOO\-style estimators and finds that critic\-free methods can match or exceed value\-based baselines in RLHF\(Ahmadianet al\.,[2024](https://arxiv.org/html/2606.00257#bib.bib8)\)\. REINFORCE\+\+ further strengthens this line by replacing the prompt\-level advantage normalization used by GRPO and RLOO with a global, batch\-level normalization, stabilizing critic\-free policy optimization without reintroducing a learned value function\(Huet al\.,[2025](https://arxiv.org/html/2606.00257#bib.bib10)\)\. In reasoning\-focused RLVR, GRPO normalizes rewards across a group of sampled responses rather than learning prefix values\(Shaoet al\.,[2024](https://arxiv.org/html/2606.00257#bib.bib11)\), and later work refines this family with theoretical analyses, off\-policy variants, and sequence\-level formulations such as GSPO\(Mrouehet al\.,[2025](https://arxiv.org/html/2606.00257#bib.bib18); Zhenget al\.,[2025](https://arxiv.org/html/2606.00257#bib.bib25)\)\. These methods substantially simplify the optimization stack, but they generally still broadcast a trajectory\-level signal across all tokens once a scalar baseline is fixed\.

### A\.3Reasoning\-Specific Analyses of RLVR

Another recent line asks why critic\-free RLVR works so well for reasoning in the first place\. A Minimalist Approach to LLM Reasoning shows that a simple rejection\-sampling baseline \(RAFT\) is surprisingly competitive with GRPO and PPO on reasoning tasks, and that GRPO’s advantage over vanilla REINFORCE comes primarily from discarding prompts whose sampled responses are all incorrect rather than from reward normalization; the authors distill this insight into Reinforce\-Rej, a minimal REINFORCE variant that filters both entirely\-correct and entirely\-incorrect samples, suggesting that much of the field’s complexity is optional rather than essential\(Xionget al\.,[2025](https://arxiv.org/html/2606.00257#bib.bib26)\)\. Complementarily, Wen et al\. analyze RLVR theoretically and empirically, arguing that answer\-level verifiable rewards can nonetheless incentivize correct intermediate reasoning early in a trajectory\(Wenet al\.,[2025](https://arxiv.org/html/2606.00257#bib.bib27)\)\. These results are important because they show that uniform outcome\-level objectives already contain nontrivial reasoning signal\.

At the same time, they do not eliminate the question of*which*tokens should absorb that signal\. Recent empirical analysis of RLVR training dynamics suggests that improvements are disproportionately driven by a minority of high\-entropy “forking” tokens, and that restricting updates to this subset can remain competitive or even outperform full\-token updates in some settings\(Wanget al\.,[2025b](https://arxiv.org/html/2606.00257#bib.bib28)\)\. Building on this observation, EAPO adapts conditional mutual information to the autoregressive RLVR setting and proves that the credit a token can carry is upper\-bounded by its entropy, motivating an entropy\-aware modulation of per\-token learning signals\(Heet al\.,[2026](https://arxiv.org/html/2606.00257#bib.bib33)\)\. ERPO similarly identifies “critical decision pivots” at high\-entropy states and applies targeted exploration at those positions\(Yuet al\.,[2026](https://arxiv.org/html/2606.00257#bib.bib35)\)\.*Sparse but Critical*argues, via a distributional\-shift analysis based on cross\-sampling between policy and reference, that a small fraction of tokens accounts for most of the useful update magnitude in RLVR, and proposes a divergence\-based selection rule for identifying them\(Menget al\.,[2026](https://arxiv.org/html/2606.00257#bib.bib42)\)\. These entropy\- and divergence\-based methods are the closest concurrent work to the intrinsic weighting schemes that form our baselines; the present paper does*not*claim priority over them\. Instead, our contribution is to show that their underlying signals are structurally degraded under LoRA, and to provide a credit\-assignment mechanism \(ARCA\) that does not suffer the same degradation\.

### A\.4Fine\-Grained Credit Assignment Beyond Uniform Token Updates

A large parallel literature attempts to make supervision denser than a single end\-of\-trajectory reward\. One family learns explicit token\- or process\-level reward estimators\. Preference\-grounded Token\-level Guidance derives token\-level signals from preference data for fine\-tuning\(Yanget al\.,[2023](https://arxiv.org/html/2606.00257#bib.bib12)\), while Discriminative Policy Optimization learns token\-level Q\-style reward models from preferences without requiring fine\-grained annotation\(Chenet al\.,[2025](https://arxiv.org/html/2606.00257#bib.bib29)\)\. PRIME pushes this direction further for reasoning tasks by constructing implicit process rewards and updating process reward models online from rollouts and outcome labels\(Cuiet al\.,[2025](https://arxiv.org/html/2606.00257#bib.bib21)\)\. Reinforced Token Optimization \(RTO\) frames RLHF as a token\-level MDP and uses DPO log\-likelihood ratios as a per\-token reward signal derived from the preference model, which is then optimized with PPO to produce fine\-grained token\-level credit\(Zhonget al\.,[2025](https://arxiv.org/html/2606.00257#bib.bib34)\)\. These methods can produce rich token\-level feedback, but they rely on an auxiliary reward\-modeling component that our approach intentionally avoids\.

A second family learns a small token\-level critic on top of the policy’s own hidden states \(rather than an external reward model\) and uses its predictions to form token\-level advantages\. This corresponds closely to an earlier version of our own framework, and we found through an extensive literature review that the token\-level actor–critic design space was already well\-covered: VinePPO\(Kazemnejadet al\.,[2024](https://arxiv.org/html/2606.00257#bib.bib20)\)and GenAC\(Shanet al\.,[2026](https://arxiv.org/html/2606.00257#bib.bib36)\)cover non\-parametric and generative critics respectively, and OTB\(Liet al\.,[2026](https://arxiv.org/html/2606.00257#bib.bib31)\)provides a principled variance\-minimizing position\-dependent baseline\. We therefore do*not*position critic\-based variants as part of the proposed method, and we leave a full comparison against token\-level critics to future empirical work\.

A third family redistributes outcome rewards algorithmically rather than by training a dense reward model\. RED derives token\-level rewards by redistributing holistic reward\-model scores over a trajectory\(Liet al\.,[2024a](https://arxiv.org/html/2606.00257#bib.bib19)\)\. SCAR uses Shapley\-inspired attribution to estimate token or span contributions from sequence\-level feedback\(Caoet al\.,[2025](https://arxiv.org/html/2606.00257#bib.bib13); Shapley,[1953](https://arxiv.org/html/2606.00257#bib.bib14)\)\. TEMPO uses groups of sampled solutions to build a prefix tree and compute nonparametric prefix values, yielding branch\-sensitive corrections without a learned critic\(Tranet al\.,[2025](https://arxiv.org/html/2606.00257#bib.bib15)\)\. GRPO\-λ\\lambdaintroduces eligibility traces and lambda\-returns into critic\-free RLVR to propagate outcome information backward along the sequence\(Parthasarathiet al\.,[2025](https://arxiv.org/html/2606.00257#bib.bib22)\)\. OTB derives an optimal position\-dependent baseline that minimizes per\-token gradient variance; the practical estimator sets the baseline at positionttto a cumulative\-gradient\-energy\-weighted average of group returns\-to\-go, where the weights are a causal logit\-gradient proxy computed directly from the forward\-pass probabilities\(Liet al\.,[2026](https://arxiv.org/html/2606.00257#bib.bib31)\)\. PSPO applies potential\-based reward shaping theory to construct dense token rewards from outcome signals without altering the optimal policy\(Huet al\.,[2026](https://arxiv.org/html/2606.00257#bib.bib32)\)\. Compared with these approaches, our focus is orthogonal: we do not propose a new redistribution rule, but instead diagnose a failure mode that can affect intrinsic\-signal\-based redistribution methods when applied inside LoRA\.

### A\.5Alternative Action Granularities and Token Selection

Several papers attack the same underlying problem by changing the optimization unit rather than by redesigning the reward\. MA\-RLHF introduces macro actions, grouping tokens into higher\-level units to shorten the temporal distance between decisions and rewards\(Chaiet al\.,[2025](https://arxiv.org/html/2606.00257#bib.bib30)\)\. GSPO moves importance weighting and clipping from the token level to the sequence level for greater stability in large\-scale RL training\(Zhenget al\.,[2025](https://arxiv.org/html/2606.00257#bib.bib25)\)\. These methods reinforce the broader point that the granularity of optimization matters as much as the reward definition\.

Outside on\-policy RL proper, token\-selective optimization has also been explored in adjacent paradigms\. ConfPO emphasizes preference\-critical tokens based on policy confidence in preference optimization\(Yoonet al\.,[2025](https://arxiv.org/html/2606.00257#bib.bib16)\), and token weighting for long\-range language modeling shows that non\-uniform token emphasis can improve optimization even in supervised settings\(Helmet al\.,[2025](https://arxiv.org/html/2606.00257#bib.bib17)\)\. Together with high\-entropy token analyses in RLVR\(Wanget al\.,[2025b](https://arxiv.org/html/2606.00257#bib.bib28)\), these results strongly suggest that uniform token treatment is an arbitrary design choice rather than a necessity\.

## Appendix BAdditional Method Context

### B\.1Relationship to Existing Methods

Our formulation recovers standard critic\-free RL as a special case: uniform weighting plus a prompt\-level batch baseline yields the usual RLOO/GRPO\-style estimator\. The intrinsic weighting schemes \(surprisal, entropy reduction, policy divergence\) are the natural baselines within our framework and map cleanly to published methods \(e\.g\., divergence weighting corresponds to the divergence\-advantage analyses of*Sparse but Critical*\(Menget al\.,[2026](https://arxiv.org/html/2606.00257#bib.bib42)\), entropy\-based weighting to\(Heet al\.,[2026](https://arxiv.org/html/2606.00257#bib.bib33); Yuet al\.,[2026](https://arxiv.org/html/2606.00257#bib.bib35)\)\)\. ARCA is the only scheme we study that uses an explicitly*adaptation\-aware*signal\. Unlike critic\-based PPO, ARCA does not learn a value head\. Unlike RED, PRIME, or token\-level reward\-model methods, ARCA does not construct dense rewards from an auxiliary model\. Unlike TEMPO or related tree\-based methods, ARCA does not build prefix graphs or estimate branch values\. Unlike hard token\-selection methods based on top\-entropy masking, ARCA retains a dense token update but modulates it continuously using an intrinsic salience score derived from the adapter itself\.

The resulting estimator is intentionally minimal\. It changes only the within\-trajectory allocation of a scalar on\-policy advantage, using quantities already available from the forward pass with the adapter disabled\. This makes it easy to layer on top of existing critic\-free RLVR pipelines while preserving their computational simplicity\.

## Appendix CTheoretical Interpretation

We provide an interpretation of intrinsic token weighting as a variance\-control and credit\-redistribution mechanism for policy gradient estimation in language models\. Our goal is not to derive new convergence guarantees, but to clarify how token weighting relates to advantage estimation and why it can be effective without learning value functions\.

### C\.1Token Weighting as Credit Redistribution

Consider the standard REINFORCE estimator with a batch\-level baseline:

g=\(R−b\)​∑t=1T∇θlog⁡πθ​\(yt∣y<t,x\)\.g=\(R\-b\)\\sum\_\{t=1\}^\{T\}\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(y\_\{t\}\\mid y\_\{<t\},x\)\.\(16\)
This estimator assigns equal credit to all tokens in a trajectory\. Introducing token weights yields:

gw=\(R−b\)​∑t=1Twt​\(y\)​∇θlog⁡πθ​\(yt∣y<t,x\),g\_\{w\}=\(R\-b\)\\sum\_\{t=1\}^\{T\}w\_\{t\}\(y\)\\,\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(y\_\{t\}\\mid y\_\{<t\},x\),\(17\)where∑twt=1\\sum\_\{t\}w\_\{t\}=1\.

Importantly, this transformation does not change the reward signal itself; it redistributes how the trajectory\-level signal is attributed to individual decisions\. Whenwtw\_\{t\}depends only on quantities available at sampling time \(e\.g\., log\-probabilities or entropy\), the estimator remains on\-policy and does not require learning additional functions\.

### C\.2Relationship to Advantage Estimation

Advantage\-based methods can be interpreted as implicitly defining token\-level weights via a learned value function:

At=R−V​\(y<t,x\)\.A\_\{t\}=R\-V\(y\_\{<t\},x\)\.\(18\)In this view, the contribution of each token is scaled according to how much the observed outcome deviates from an estimated expectation conditioned on the prefix\.

However, this interpretation relies on the existence of a meaningful value function over prefixes\. In language generation, prefixes are non\-Markov and do not form a stable state space, makingV​\(y<t,x\)V\(y\_\{<t\},x\)difficult to estimate and prone to bias\. Token weighting offers an alternative: rather than estimating expected returns from prefixes, it emphasizes tokens based on intrinsic measures of decision salience, such as uncertainty or information gain\.

### C\.3Connection to Control Variates

From a statistical perspective, subtracting a baselinebbis a form of control variate that reduces variance without affecting unbiasedness\. Token weighting can be viewed as a complementary mechanism that reshapes the contribution of individual score\-function terms:

∇θlog⁡πθ​\(y\)=∑t=1T∇θlog⁡πθ​\(yt∣y<t,x\)\.\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(y\)=\\sum\_\{t=1\}^\{T\}\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(y\_\{t\}\\mid y\_\{<t\},x\)\.\(19\)
Uniform weighting treats all score\-function terms equally, regardless of their variance or relevance\. Intrinsic weighting schemes effectively reweight these terms, down\-weighting low\-variance or low\-impact contributions \(e\.g\., high\-probability continuation tokens\) and emphasizing high\-variance or high\-impact decisions \(e\.g\., low\-probability or uncertainty\-reducing tokens\)\. While this reweighting introduces bias relative to the uniform estimator, it can substantially reduce variance, leading to improved optimization dynamics in practice\.

### C\.4Interpretation of Specific Weighting Schemes

#### Surprisal Weighting\.

Surprisal\-based weights emphasize tokens with low conditional probability under the current policy\. These tokens correspond to decisions where small parameter changes can induce large changes in likelihood, and therefore often dominate gradient variance\. Weighting by surprisal concentrates learning signal on such high\-sensitivity decisions\.

#### Entropy\-Reduction Weighting\.

Entropy\-reduction weighting emphasizes tokens that sharply reduce predictive entropy\. These tokens correspond to commitment points in generation, where the model transitions from a diffuse distribution over continuations to a more concentrated one\. This mirrors the role of branching points in tree\-based reasoning methods, but is computed locally from the model’s predictive distribution without explicit tree construction\.

### C\.5Bias–Variance Tradeoff

Both advantage estimation and intrinsic token weighting introduce bias in exchange for variance reduction\. Advantage estimation does so by relying on a learned value function, whose bias can be substantial when the underlying state abstraction is weak\. Intrinsic token weighting introduces bias by reshaping the contribution of token\-level gradients based on heuristic salience measures\. The empirical question is therefore not whether bias is introduced, but whether the bias–variance tradeoff is favorable\.

This motivates the empirical comparison implemented in the repository: if intrinsic token weighting improves accuracy, pass@kk, or optimization behavior under matched settings, then the bias it introduces may be more benign than the bias induced by ill\-defined value targets over text prefixes\. The point is not that weighting is unbiased, but that its inductive bias may be better aligned with token\-level credit assignment in language generation\.

### C\.6Signal Degeneration Under Low\-Rank Adaptation

The preceding analysis assumes that the intrinsic weights\{wt\}t=1T\\\{w\_\{t\}\\\}\_\{t=1\}^\{T\}are non\-degenerate: if every scheme becomes uniform or spuriously sparse for reasons unrelated to the learned adapter, then token weighting cannot provide meaningful credit redistribution over a uniform baseline\. We now formalize when this degeneracy occurs under LoRA\.

#### Setup\.

Letπref\\pi\_\{\\mathrm\{ref\}\}be a frozen base model with last\-layer hidden stateshtbase∈ℝdh\_\{t\}^\{\\mathrm\{base\}\}\\in\\mathbb\{R\}^\{d\}and a LoRA adapter whose net contribution at positionttisΔ​ht∈ℝd\\Delta h\_\{t\}\\in\\mathbb\{R\}^\{d\}, giving adapted hidden stateshtadapted=htbase\+Δ​hth\_\{t\}^\{\\mathrm\{adapted\}\}=h\_\{t\}^\{\\mathrm\{base\}\}\+\\Delta h\_\{t\}and logitsztθ=Wlm​htadaptedz\_\{t\}^\{\\theta\}=W\_\{\\mathrm\{lm\}\}h\_\{t\}^\{\\mathrm\{adapted\}\}\. Letβ=maxt⁡‖Δ​ht‖2\\beta=\\max\_\{t\}\\\|\\Delta h\_\{t\}\\\|\_\{2\}denote the maximum per\-position adapter contribution andL=‖Wlm‖opL=\\\|W\_\{\\mathrm\{lm\}\}\\\|\_\{\\mathrm\{op\}\}its logit\-space Lipschitz constant\. WriteG​\(⋅\)G\(\\cdot\)for the Gini coefficient of a non\-negative vector andEffN​\(w\)/T=1/\(T​∑twt2\)\\mathrm\{EffN\}\(w\)/T=1/\(T\\sum\_\{t\}w\_\{t\}^\{2\}\)for the normalized effective\-token count\. When a normalization uses anε\\varepsilonfloor, the statements below are interpreted in the regime where the raw salience scores dominateε\\varepsilon; if fixedε\\varepsilondominates all scores, every scheme reverts to uniform by construction\. These are local deterministic statements about the normalized score vectors for a fixed trajectory, not asymptotic convergence guarantees for RL training\.

###### Proposition 1\(Degeneration of surprisal and entropy weighting\)\.

Letαtsurp=−log⁡πθ​\(yt∣y<t,x\)\\alpha\_\{t\}^\{\\mathrm\{surp\}\}=\-\\log\\pi\_\{\\theta\}\(y\_\{t\}\\mid y\_\{<t\},x\)andαtsurp,ref=−log⁡πref​\(yt∣y<t,x\)\\alpha\_\{t\}^\{\\mathrm\{surp,ref\}\}=\-\\log\\pi\_\{\\mathrm\{ref\}\}\(y\_\{t\}\\mid y\_\{<t\},x\), with corresponding normalized weightswsurpw^\{\\mathrm\{surp\}\}andwsurp,refw^\{\\mathrm\{surp,ref\}\}\. Then for every trajectory,

∥wsurp−wsurp,ref∥1≤4​L​β∑tαtsurp,ref\.\\lVert w^\{\\mathrm\{surp\}\}\-w^\{\\mathrm\{surp,ref\}\}\\rVert\_\{1\}\\leq\\frac\{4L\\beta\}\{\\sum\_\{t\}\\alpha\_\{t\}^\{\\mathrm\{surp,ref\}\}\}\.In particular, the adapted surprisal weights remain close to the reference model’s surprisal profile rather than becoming an adapter\-specific credit signal\. If that reference profile is close to uniform, the weights approach1/T1/T; if it is highly uneven, the weights can instead remain spuriously concentrated for reasons inherited from the base model\. The analogous statement holds for the entropy\-reduction score\.

*Proof\.*The softmax Jacobian is22\-Lipschitz in the log\-space norm, so the per\-token log\-probabilities differ by at most2​L​β2L\\betabetween the adapted and base model\. The projection onto the simplex via within\-trajectory normalization is11\-Lipschitz inℓ1\\ell\_\{1\}up to a factor of2/∑tαtref2/\\sum\_\{t\}\\alpha\_\{t\}^\{\\mathrm\{ref\}\}; combining these gives the stated bound\.□\\square

###### Proposition 2\(Degeneration of divergence weighting\)\.

Letαtdiv=\|log⁡πθ​\(yt∣y<t,x\)−log⁡πref​\(yt∣y<t,x\)\|\\alpha\_\{t\}^\{\\mathrm\{div\}\}=\\lvert\\log\\pi\_\{\\theta\}\(y\_\{t\}\\mid y\_\{<t\},x\)\-\\log\\pi\_\{\\mathrm\{ref\}\}\(y\_\{t\}\\mid y\_\{<t\},x\)\\rvertandwdiv=\(αdiv\+ε\)/∑t\(αtdiv\+ε\)w^\{\\mathrm\{div\}\}=\(\\alpha^\{\\mathrm\{div\}\}\+\\varepsilon\)/\\sum\_\{t\}\(\\alpha\_\{t\}^\{\\mathrm\{div\}\}\+\\varepsilon\)\. Then

0≤αtdiv≤2​L​β∀t,0\\leq\\alpha\_\{t\}^\{\\mathrm\{div\}\}\\leq 2L\\beta\\qquad\\forall t,so ifε\\varepsilonis fixed andβ→0\\beta\\to 0, thenwdiv→1/Tw^\{\\mathrm\{div\}\}\\to 1/T\. Ifε=0\\varepsilon=0or is negligible relative to the raw scores, the same normalization is ill\-conditioned asβ→0\\beta\\to 0: the limiting weights are determined by tiny relative differences among vanishing log\-probability shifts rather than by a robust adapter\-credit signal\.

*Proof\.*The first inequality follows from the same log\-probability Lipschitz bound as Proposition[1](https://arxiv.org/html/2606.00257#Thmproposition1)\. Ifε\>0\\varepsilon\>0is fixed, then each numerator inwdivw^\{\\mathrm\{div\}\}isε​\(1\+O​\(β/ε\)\)\\varepsilon\(1\+O\(\\beta/\\varepsilon\)\)and the denominator isT​ε​\(1\+O​\(β/ε\)\)T\\varepsilon\(1\+O\(\\beta/\\varepsilon\)\), sowtdiv→1/Tw\_\{t\}^\{\\mathrm\{div\}\}\\to 1/Tasβ/ε→0\\beta/\\varepsilon\\to 0\. If the floor is removed or negligible, the normalization divides by∑tαtdiv→0\\sum\_\{t\}\\alpha\_\{t\}^\{\\mathrm\{div\}\}\\to 0; different sequences of vanishing score vectors can converge to different normalized limits, including highly concentrated ones\.□\\square

The key observation is that the intrinsic scores are all determined by the per\-token*log\-probability differential*, which is precisely what LoRA’s rank\-rrconstraint drives small and approximately uniform across positions\. The degeneration is not a property of a particular weighting scheme; it is a property of measuring per\-token variation through the output distribution when the policy has been constrained to a small neighborhood of the reference\.

### C\.7ARCA and Position\-Discriminative Signal

ARCA side\-steps this degeneration by measuring a quantity that is position\-varying even when the adapter’s logit\-space impact is small\.

###### Proposition 3\(ARCA remains non\-degenerate\)\.

Suppose the adapter\-induced residual admits the decompositionΔ​ht=∑ℓBℓ​Aℓ​xℓ,tpre\+et\\Delta h\_\{t\}=\\sum\_\{\\ell\}B\_\{\\ell\}A\_\{\\ell\}x\_\{\\ell,t\}^\{\\mathrm\{pre\}\}\+e\_\{t\}, wherexℓ,tprex\_\{\\ell,t\}^\{\\mathrm\{pre\}\}is the pre\-adapter activation at layerℓ\\elland positionttand‖et‖\\\|e\_\{t\}\\\|is a higher\-order cross\-layer term\. Letvℓ,t=Bℓ​Aℓ​xℓ,tprev\_\{\\ell,t\}=B\_\{\\ell\}A\_\{\\ell\}x\_\{\\ell,t\}^\{\\mathrm\{pre\}\}and defineVℓ=Vart​\(‖vℓ,t‖\)V\_\{\\ell\}=\\mathrm\{Var\}\_\{t\}\(\\\|v\_\{\\ell,t\}\\\|\)andV¯ℓ=𝔼t​\[‖vℓ,t‖\]\\bar\{V\}\_\{\\ell\}=\\mathbb\{E\}\_\{t\}\[\\\|v\_\{\\ell,t\}\\\|\]\. Then

Vart​\(αtARCA\)≥maxℓ⁡Vℓ−O​\(∑ℓV¯ℓ​‖et‖\)\.\\mathrm\{Var\}\_\{t\}\(\\alpha\_\{t\}^\{\\mathrm\{ARCA\}\}\)\\geq\\max\_\{\\ell\}V\_\{\\ell\}\-O\\\!\\left\(\\sum\_\{\\ell\}\\bar\{V\}\_\{\\ell\}\\,\\\|e\_\{t\}\\\|\\right\)\.Assume the residual scores dominate the normalization floorε\\varepsilon\. In particular, as long as the pre\-adapter activationsxℓ,tprex\_\{\\ell,t\}^\{\\mathrm\{pre\}\}have nontrivial position\-to\-position variance at any layer,αtARCA\\alpha\_\{t\}^\{\\mathrm\{ARCA\}\}has bounded\-below variance across positions, and its normalized weightswtARCAw\_\{t\}^\{\\mathrm\{ARCA\}\}do not converge to the uniform distribution merely because the output\-distribution shift is small\. With fixedε\\varepsilon, however, all scores become uniform in the limit where the adapter residual itself vanishes below the floor\.

*Proof\.*Ignoring the higher\-order residualete\_\{t\}, the ARCA score contains the norms‖vℓ,t‖\\\|v\_\{\\ell,t\}\\\|of the layerwise adapter contributions\. A dominant layer with nonzero position\-to\-position variance therefore gives nonzero variance in the raw ARCA scores\. The perturbationete\_\{t\}changes these norms by at most its size through the reverse triangle inequality, giving the stated lower bound up to the displayed higher\-order term\. Normalization by a common positive sum preserves non\-uniformity while the raw scores remain above the floor; if a fixedε\\varepsilondominates all residual scores, the normalized weights become uniform by equation \([8](https://arxiv.org/html/2606.00257#S3.E8)\)\.□\\square

The intuitive content of Proposition[3](https://arxiv.org/html/2606.00257#Thmproposition3)is that ARCA inherits its position\-discrimination from the*base model’s own activation pattern*\(which is already non\-uniform because attention and layer norm routing produce content\-versus\-filler distinctions\) rather than from a quantity that LoRA is specifically constrained to make small\. Scaling the adapter weights uniformly scalesαARCA\\alpha^\{\\mathrm\{ARCA\}\}uniformly, so the normalized weightswARCAw^\{\\mathrm\{ARCA\}\}are scale\-invariant in the adapter magnitude\.

### C\.8Bias–Variance Tradeoff

Both advantage estimation and intrinsic token weighting introduce bias in exchange for variance reduction\. Advantage estimation does so by relying on a learned value function, whose bias can be substantial when the underlying state abstraction is weak\. Intrinsic token weighting introduces bias by reshaping the contribution of token\-level gradients based on heuristic salience measures\. The empirical question is therefore not whether bias is introduced, but whether the bias–variance tradeoff is favorable and whether, per Propositions[1](https://arxiv.org/html/2606.00257#Thmproposition1)and[2](https://arxiv.org/html/2606.00257#Thmproposition2), the bias\-inducing signal survives the LoRA bottleneck at all\.

### C\.9Summary

This perspective reframes advantage estimation as one particular approach to credit redistribution in policy gradient methods\. Intrinsic token weighting offers an alternative that leverages the structure of the language model’s predictive distribution, without assuming the existence of meaningful state values\. Under LoRA, however, the output\-distribution\-based signals that most published weighting methods rely on can degenerate into uniform or spuriously sparse token weights; the adapter\-residual signal that ARCA uses is, by Proposition[3](https://arxiv.org/html/2606.00257#Thmproposition3), non\-degenerate whenever the adapter residual remains measurable above the normalization floor\.

Similar Articles

Agentic RL: Token-In, Token-Out Done Right (16 minute read)

TLDR AI

This article explains the 'Token-In, Token-Out' (TITO) invariant in reinforcement learning for LLMs, highlighting a common error when training multi-turn agents with tool calls. It presents two solutions: using per-model renderers or designing training to avoid re-encoding decoded tokens, emphasizing prefix-preserving chat templates.

Adaptive Latent Agentic Reasoning

arXiv cs.CL

This paper introduces Adaptive Latent Agentic Reasoning (ALAR), a dual-mode framework for LLM agents that uses compact latent reasoning for routine turns and selectively escalates to explicit chain-of-thought for harder decisions, achieving up to 84.6% token reduction while maintaining task accuracy.

When LLM Reward Design Fails: Diagnostic-Driven Refinement for Sparse Structured RL

arXiv cs.LG

This paper frames LLM-generated reward shaping for sparse structured RL as a debugging problem, identifying failure modes like reward flooding and semantic misunderstanding. The authors propose diagnostic-driven iterative refinement, achieving dramatic success rate improvements (e.g., DoorKey-8×8 from 2.3% to 97.6%) compared to one-shot generation.