LEAF: Growing Trees Without Branching for Speech-Aware Large Language Model Post-Training
Summary
This paper proposes LEAF, a retrospective tree-based reinforcement learning method for speech-aware large language model post-training that improves credit assignment without online branching. LEAF outperforms GRPO on speech question answering and speech translation benchmarks.
View Cached Full Text
Cached at: 06/09/26, 08:50 AM
# LEAF: Growing Trees Without Branching for Speech-Aware Large Language Model Post-Training
Source: [https://arxiv.org/html/2606.07610](https://arxiv.org/html/2606.07610)
Mark Hasegawa\-Johnson &Venugopal V\. Veeravalli University of Illinois, Urbana\-Champaign
###### Abstract
State\-of\-the\-art GRPO\-style methods for speech\-aware large language model post\-training suffer from coarse credit assignment, broadcasting the same terminal\-reward advantage to every token in a response\. This ignores useful structure within rollout batches, where speech\-conditioned completions often share prefixes before diverging at important decisions\. We propose Low\-rank Exploration with Adaptive Forking \(LEAF\), a retrospective tree\-based RL method that recovers this structure without online branching or additional decoding\. LEAF samples complete responses, selects high\-surprisal boundaries, groups responses by shared prefixes, and assigns span\-level advantages using descendant rewards\. We theoretically justify LEAF’s span\-level credit assignment and boundary\-selection design\. Empirically, LEAF improves over GRPO across speech question answering and speech translation benchmarks under the same rollout and low\-rank adaptation budget\. Notably, smaller LEAF\-trained models outperform current state\-of\-the\-art, full\-parameter baselines\.
LEAF: Growing Trees Without Branching for Speech\-Aware Large Language Model Post\-Training
Argyrios Gerogiannis††thanks:Correspondence:ag91@illinois\.edu\.Yekaterina Yegorova
Mark Hasegawa\-JohnsonVenugopal V\. Veeravalli
University of Illinois, Urbana\-Champaign
## 1Introduction
Speech\-aware large language models \(SALLMs\)Aroraet al\.\([2025](https://arxiv.org/html/2606.07610#bib.bib2)\)enable end\-to\-end spoken interaction across multiple tasks such as spoken question answering \(SQA\), automatic speech translation \(AST\), and speech recognition\. As these models improve, post\-training becomes essential for aligning generation with task\-level objectives not captured by next\-token prediction alone\. Reinforcement learning \(RL\) is a natural fit, and recent GRPO\-style methodsShaoet al\.\([2024](https://arxiv.org/html/2606.07610#bib.bib34)\); Yuet al\.\([2025](https://arxiv.org/html/2606.07610#bib.bib12)\)have shown that reward\-driven adaptation can improve open\-form speech understanding while avoiding additional training costs by normalizing terminal rewards across multiple completions\.
Despite its simplicity, GRPO is limited by coarse credit assignment: it computes a terminal reward per response and broadcasts an identical scalar advantage to every token\. While efficient, this discards*where*a completion becomes promising or harmful\. The issue is especially pronounced in open\-form speech post\-training, where candidate answers often share early segments and diverge only around uncertain semantic decisions, so the bottleneck is not reward design alone but*rollout organization*and*credit assignment*\.
Tree\-structured RL offers a finer alternative: by branching from intermediate states and using descendant outcomes for process\-level supervision, it localizes credit to the decisions that matter\. Directly importing online tree expansion into SALLM post\-training is costly however, since speech inputs are long and multimodal, decoding is batched, and rewards are task\-specific\. This motivates our central question:*can we obtain the benefits of structured exploration and process\-aware credit assignment while retaining the simplicity of batched GRPO\-style training for SALLMs?*
We answer withLow\-rankExploration withAdaptiveForking \(LEAF\), a retroactive RL method for SALLM post\-training\. LEAF samples the sameKKcomplete responses as GRPO, then reorganizes them into a sparse retrospective prefix tree\. It selects a small number of high\-surprisal token boundaries, groups completions by exact shared prefixes at those boundaries, estimates prefix values from descendant terminal rewards, and assigns span\-level advantages between retained nodes\. LEAF therefore keeps GRPO’s batched generation pattern while replacing sequence\-level credit with structured span\-level supervision derived from the rollout batch itself\. Finally, LEAF is designed for parameter\-efficient speech post\-training\. We instantiate it with LoRAHuet al\.\([2022](https://arxiv.org/html/2606.07610#bib.bib11)\), motivated by recent evidence that low\-rank adaptation can approach full\-parameter RL post\-training performance\(Schulman and Lab,[2025](https://arxiv.org/html/2606.07610#bib.bib16)\)\. This lets us test whether better credit assignment can translate into stronger SALLM adaptation under a lightweight trainable\-parameter budget\. To our knowledge, LEAF is the first tree\-based RL method for speech post\-training, and the first within retrospective credit assignment to use an uncertainty\-guided mechanism for constructing prefix\-level supervision\.
Contributions\.We introduce LEAF, a retroactive tree\-based RL method for SALLM post\-training that extracts process\-aware span advantages from the same i\.i\.d\. rollout groups used by GRPO\. We provide theoretical support for LEAF’s design, showing that span\-level prefix advantages are valid credit signals, that the fork budget controls a support–resolution tradeoff, and that pure prefix matching is biased toward shallow high\-collision prefixes\. Empirically, we show that speech\-understanding rollouts are highly forkable across tasks and backbones, and that LEAF consistently improves over GRPO under the same rollout and LoRA adaptation budget, with gains in automatic metrics, judge scores and tail\-risk behavior\.
## 2Related Work
GRPO for SALLMs\.GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2606.07610#bib.bib34)\)is a critic\-free RL objective that normalizes terminal rewards within a group of sampled completions\. Recent work applies GRPO to speech\-aware language models for open\-form speech understanding, including SQA and AST\(Elmakieset al\.,[2026](https://arxiv.org/html/2606.07610#bib.bib1)\), as well as to automatic speech recognition\(Shivakumaret al\.,[2025](https://arxiv.org/html/2606.07610#bib.bib14)\)and text\-to\-speech\(Liuet al\.,[2025](https://arxiv.org/html/2606.07610#bib.bib15)\)\. These works show that reward\-driven post\-training is effective for speech, but retain GRPO’s flat credit assignment: one terminal\-reward advantage is broadcast to all tokens in a response\. LEAF keeps the same group\-based rollout interface, but replaces sequence\-level credit with span\-level credit derived from shared prefixes\.
Fine\-grained and tree\-structured credit\.Several recent methods seek denser credit than trajectory\-level rewards in LLM RL\. VinePPO estimates intermediate values by sampling continuations from partial states\(Kazemnejadet al\.,[2025](https://arxiv.org/html/2606.07610#bib.bib3)\), SPO assigns advantages at the segment level\(Guoet al\.,[2026](https://arxiv.org/html/2606.07610#bib.bib36)\), and high\-entropy analyses suggest that a few uncertain tokens often drive downstream decisions\(Wanget al\.,[2026](https://arxiv.org/html/2606.07610#bib.bib37)\)\. Tree\-based methods obtain process\-level supervision by organizing rollouts into branching structures\(Liet al\.,[2025](https://arxiv.org/html/2606.07610#bib.bib6); Yanget al\.,[2025](https://arxiv.org/html/2606.07610#bib.bib5); Jiet al\.,[2026](https://arxiv.org/html/2606.07610#bib.bib13); Houet al\.,[2025](https://arxiv.org/html/2606.07610#bib.bib4)\)\. LEAF is closest to TreeRL and TEMPO/P2T\(Houet al\.,[2025](https://arxiv.org/html/2606.07610#bib.bib4); Tranet al\.,[2025](https://arxiv.org/html/2606.07610#bib.bib18)\), which estimate prefix\-level credit from a fixed group of sampled completions, but it differs in setting, construction, and credit assignment: it performs no additional decoding, targets speech rather than text LLMs, and forks only at a sparse set of high\-surprisal locations rather than building the full prefix tree by exact token matching\. We formalize in Section[4](https://arxiv.org/html/2606.07610#S4)why such collision\-only matching is problematic, since it biases credit toward shallow, high\-support prefixes that need not carry reward\-relevant information\.
## 3Method
LEAF departs from the dominant execution model in tree\-structured RL, which expands a search tree iteratively at inference time\. Instead, it exploits structure already present in the batch of i\.i\.d\. traces collected during training: candidate responses that share prefixes are organized retrospectively into a prefix tree, with no additional rollouts\. This tree replaces GRPO’s advantage computation\. Rather than assigning a single scalar advantage to every token in a response, LEAF derives piecewise\-constant advantages over token spans, localizing credit to where completions diverge \(Figure[1](https://arxiv.org/html/2606.07610#S3.F1)\)\.
### 3\.1Preliminaries: GRPO
Letxxbe a speech\-conditioned input,πold\\pi\_\{\\mathrm\{old\}\}the rollout policy,πθ\\pi\_\{\\theta\}the optimized policy, andπref\\pi\_\{\\mathrm\{ref\}\}a fixed reference policy\. GRPO samplesKKresponsesy\(1\),…,y\(K\)∼πold\(⋅∣x\)y^\{\(1\)\},\\ldots,y^\{\(K\)\}\\sim\\pi\_\{\\mathrm\{old\}\}\(\\cdot\\mid x\)and assigns terminal rewardsri=r\(x,y\(i\)\)r\_\{i\}=r\(x,y^\{\(i\)\}\)\. It forms the group\-relative advantageAiGRPO=\(ri−r¯\)/σrA\_\{i\}^\{\\mathrm\{GRPO\}\}=\(r\_\{i\}\-\\bar\{r\}\)/\\sigma\_\{r\}, wherer¯\\bar\{r\}andσr\\sigma\_\{r\}are the sample mean and standard deviation, and broadcasts this scalar to every generated token in responseii\. For tokenat\(i\)a\_\{t\}^\{\(i\)\}generated from statest\(i\)=\(x,y<t\(i\)\)s\_\{t\}^\{\(i\)\}=\(x,y\_\{<t\}^\{\(i\)\}\), defineρi,t\(θ\)=πθ\(at\(i\)\|st\(i\)\)/πold\(at\(i\)\|st\(i\)\)\\rho\_\{i,t\}\(\\theta\)=\\pi\_\{\\theta\}\(a\_\{t\}^\{\(i\)\}\|s\_\{t\}^\{\(i\)\}\)/\\pi\_\{\\mathrm\{old\}\}\(a\_\{t\}^\{\(i\)\}\|s\_\{t\}^\{\(i\)\}\)\. Given token\-level advantagesAi,tA\_\{i,t\}, letℓi,t\(θ\)=min\{ρi,tAi,t,clipϵ\(ρi,t\)Ai,t\}\\ell\_\{i,t\}\(\\theta\)=\\min\\\{\\rho\_\{i,t\}A\_\{i,t\},\\operatorname\{clip\}\_\{\\epsilon\}\(\\rho\_\{i,t\}\)A\_\{i,t\}\\\}\. The clipped KL\-regularized loss is
ℒ\(θ\)=−𝔼i,t\[ℓi,t\(θ\)\]\+β𝔼i,t\[KL^i,t\],\\displaystyle\\mathcal\{L\}\(\\theta\)=\-\\mathbb\{E\}\_\{i,t\}\\\!\\left\[\\ell\_\{i,t\}\(\\theta\)\\right\]\+\\beta\\,\\mathbb\{E\}\_\{i,t\}\\\!\\left\[\\widehat\{\\mathrm\{KL\}\}\_\{i,t\}\\right\],\(1\)whereKL^i,t\\widehat\{\\mathrm\{KL\}\}\_\{i,t\}estimates the token\-level divergence fromπref\\pi\_\{\\mathrm\{ref\}\}andclipϵ\(ρ\)\\operatorname\{clip\}\_\{\\epsilon\}\(\\rho\)truncatesρ\\rhoto the range\[1−ϵ,1\+ϵ\]\[1\-\\epsilon,1\+\\epsilon\]\. GRPO is recovered by settingAi,t=AiGRPOA\_\{i,t\}=A\_\{i\}^\{\\mathrm\{GRPO\}\}for alltt\.
### 3\.2Retrospective Tree Credit Assignment
Scan surprisals and select boundariesPrefix tree, retained nodes, and span credity\(1\)y^\{\(1\)\}y\(5\)y^\{\(5\)\}y\(2\)y^\{\(2\)\}y\(3\)y^\{\(3\)\}y\(4\)y^\{\(4\)\}y\(6\)y^\{\(6\)\}012345678t=t=greedy scan in descending surprisal\!Δ\\Delta\!BBsampleKKcompleted responsesand scan token surprisalssorted candidates by surprisal\(2,2\),\(1,6\),\(5,7\),\(6,8\)\(2,2\),\\ \(1,6\),\\ \(5,7\),\\ \(6,8\)scan\(2,2\)\(2,2\): proposep=2p=2accept⇒𝒫=\{2\}\\Rightarrow\\mathcal\{P\}=\\\{2\\\}scan\(1,6\)\(1,6\): proposep=6p=6accept⇒𝒫=\{2,6\}\\Rightarrow\\mathcal\{P\}=\\\{2,6\\\}scan\(5,7\)\(5,7\): proposep=7p=7reject byΔ\\Delta:\|7−6\|<Δ\|7\-6\|<\\Deltascan\(6,8\)\(6,8\): proposep=8p=8skip:\|𝒫\|=B=2\|\\mathcal\{P\}\|=B=2selected global boundaries𝒫=\{2,6\}\\mathcal\{P\}=\\\{2,6\\\}Δ\\Deltaavoids nearby boundariesBBcaps how many are retainedv0v\_\{0\}rootv2v\_\{2\}vbv\_\{b\}vov\_\{o\}vgv\_\{g\}y\(1\)y^\{\(1\)\}y\(2\)y^\{\(2\)\}y\(3\)y^\{\(3\)\}y\(4\)y^\{\(4\)\}y\(5\)y^\{\(5\)\}y\(6\)y^\{\(6\)\}A\(v2\)A\(v\_\{2\}\)A\(vb\)A\(v\_\{b\}\)A\(vo\)A\(v\_\{o\}\)Atail\(1\)A\_\{\\mathrm\{tail\}\}^\{\(1\)\}Atail\(2\)A\_\{\\mathrm\{tail\}\}^\{\(2\)\}Atail\(3\)A\_\{\\mathrm\{tail\}\}^\{\(3\)\}Atail\(4\)A\_\{\\mathrm\{tail\}\}^\{\(4\)\}Atail\(5\)A\_\{\\mathrm\{tail\}\}^\{\(5\)\}Atail\(6\)A\_\{\\mathrm\{tail\}\}^\{\(6\)\}v2v\_\{2\}\(2,u2\),n=6\(2,u\_\{2\}\),\\ n=6u2=y<2\(1\)=⋯=y<2\(6\)u\_\{2\}=y\_\{<2\}^\{\(1\)\}=\\cdots=y\_\{<2\}^\{\(6\)\}V^\(v2\)=\(r1\+⋯\+r6\)/6\\widehat\{V\}\(v\_\{2\}\)=\(r\_\{1\}\+\\cdots\+r\_\{6\}\)/6A\(v2\)=2\(V^\(v2\)−V^\(v0\)\)6\\scriptstyle A\(v\_\{2\}\)=\\frac\{2\(\\widehat\{V\}\(v\_\{2\}\)\-\\widehat\{V\}\(v\_\{0\}\)\)\}\{\\sqrt\{6\}\}vbv\_\{b\}\(6,ub\),n=3\(6,u\_\{b\}\),\\ n=3y<6\(1\)=y<6\(2\)=y<6\(3\)y\_\{<6\}^\{\(1\)\}=y\_\{<6\}^\{\(2\)\}=y\_\{<6\}^\{\(3\)\}V^\(vb\)=\(r1\+r2\+r3\)/3\\widehat\{V\}\(v\_\{b\}\)=\(r\_\{1\}\+r\_\{2\}\+r\_\{3\}\)/3A\(vb\)=\(V^\(vb\)−V^\(v0\)\)\+\(V^\(vb\)−V^\(v2\)\)3\\scriptstyle A\(v\_\{b\}\)=\\frac\{\(\\widehat\{V\}\(v\_\{b\}\)\-\\widehat\{V\}\(v\_\{0\}\)\)\+\(\\widehat\{V\}\(v\_\{b\}\)\-\\widehat\{V\}\(v\_\{2\}\)\)\}\{\\sqrt\{3\}\}vov\_\{o\}\(6,uo\),n=2\(6,u\_\{o\}\),\\ n=2y<6\(4\)=y<6\(5\)y\_\{<6\}^\{\(4\)\}=y\_\{<6\}^\{\(5\)\}V^\(vo\)=\(r4\+r5\)/2\\widehat\{V\}\(v\_\{o\}\)=\(r\_\{4\}\+r\_\{5\}\)/2A\(vo\)=\(V^\(vo\)−V^\(v0\)\)\+\(V^\(vo\)−V^\(v2\)\)2\\scriptstyle A\(v\_\{o\}\)=\\frac\{\(\\widehat\{V\}\(v\_\{o\}\)\-\\widehat\{V\}\(v\_\{0\}\)\)\+\(\\widehat\{V\}\(v\_\{o\}\)\-\\widehat\{V\}\(v\_\{2\}\)\)\}\{\\sqrt\{2\}\}ugu\_\{g\}n=1n=1ug=y<6\(6\)u\_\{g\}=y\_\{<6\}^\{\(6\)\}singleton prefix; no retained nodevgv\_\{g\}Atail\(6\)=\(r6−V^\(v0\)\)\+\(r6−V^\(v2\)\)\\scriptstyle A\_\{\\mathrm\{tail\}\}^\{\(6\)\}=\(r\_\{6\}\-\\widehat\{V\}\(v\_\{0\}\)\)\+\(r\_\{6\}\-\\widehat\{V\}\(v\_\{2\}\)\)Span\-credit examples using retained nodesBlue path:v0→v2→vbv\_\{0\}\\to v\_\{2\}\\to v\_\{b\}y\(1\)y^\{\(1\)\}p=2p=2p=6p=6A\(v2\)A\(v\_\{2\}\)A\(vb\)A\(v\_\{b\}\)Atail\(1\)A\_\{\\mathrm\{tail\}\}^\{\(1\)\}Green path:v0→v2v\_\{0\}\\to v\_\{2\};p=6p=6is singletony\(6\)y^\{\(6\)\}p=2p=2p=6p=6novgv\_\{g\}A\(v2\)A\(v\_\{2\}\)Atail\(6\)A\_\{\\mathrm\{tail\}\}^\{\(6\)\}
Figure 1:Overview of LEAF\.From the sameKKcompleted responses, LEAF records token surprisals and greedily selects a small set of globally applied boundaries𝒫\\mathcal\{P\}using a separation rule\. At each selected boundarypp, responses are grouped by exact token\-prefix equality, forming a tree over the selected boundaries\. Non\-singleton prefix groups become retained nodes whose values are averages of descendant terminal rewards, while singleton groups are not retained as advantage nodes\. The retained nodes on each response path partition the response into spans, and LEAF assigns one advantage to each span\.Conceptually, LEAF asks whether shared partial prefixes in the rollout batch predict better or worse final outcomes\. It selects at mostBBtoken boundaries, groups responses by exact prefix equality at each boundary, retains non\-singleton groups as prefix nodes, and assigns advantages to the spans induced by the retained nodes\. The only generation cost is the originalKKresponses;BBcontrols the resolution of credit assignment, not the rollout\.
LEAF replaces only the advantage estimatorAi,tA\_\{i,t\}\. The estimator is inspired by the process supervision ofHouet al\.\([2025](https://arxiv.org/html/2606.07610#bib.bib4)\), which assigns intermediate credit using descendant leaf values, but differs in how the tree is obtained and extends naturally to continuous rewards\. The resulting estimator is characterized in Section[4](https://arxiv.org/html/2606.07610#S4)\. Rather than branching from intermediate prefixes and decoding new continuations, LEAF samples the sameKKcomplete responses used by GRPO and constructs a prefix tree*retrospectively*from shared prefixes within the sampled batch\. Thus, the rollout budget remains exactlyKKcomplete responses\.
Let responseiibey\(i\)=\(a0\(i\),a1\(i\),…,aHi−1\(i\)\)y^\{\(i\)\}=\(a\_\{0\}^\{\(i\)\},a\_\{1\}^\{\(i\)\},\\ldots,a\_\{H\_\{i\}\-1\}^\{\(i\)\}\), whereHiH\_\{i\}is its length\. For a token boundarypp, define the length\-ppprefixy<p\(i\)=\(a0\(i\),…,ap−1\(i\)\)y\_\{<p\}^\{\(i\)\}=\(a\_\{0\}^\{\(i\)\},\\ldots,a\_\{p\-1\}^\{\(i\)\}\)\. A selected boundaryppis applied to allKKresponses: responses are grouped together if and only if their token prefixesy<p\(i\)y\_\{<p\}^\{\(i\)\}are exactly identical\. Therefore, a selected boundary can produce zero, one, or multiple retained prefix nodes, depending on how many repeated prefix groups appear in the batch\. LEAF replaces the advantage estimation as follows\.
Step 1: Responses and surprisals\.We sampleKKcomplete responses fromπold\\pi\_\{\\mathrm\{old\}\}and compute their terminal rewardsr1,…,rKr\_\{1\},\\ldots,r\_\{K\}\. During generation, for eachi∈\[K\]i\\in\[K\], we record the token surprisalzi,t=−logπold\(at\(i\)∣x,y<t\(i\)\),z\_\{i,t\}=\-\\log\\pi\_\{\\mathrm\{old\}\}\(a\_\{t\}^\{\(i\)\}\\mid x,y\_\{<t\}^\{\(i\)\}\),for0≤t<Hi0\\leq t<H\_\{i\}\. Surprisal is used only to decide where to inspect, without affecting the rollout distribution\.
Step 2: Fork boundaries\.We use high\-surprisal token positions to choose prefix boundaries\. Letℓmin=mini∈\[K\]Hi\\ell\_\{\\min\}=\\min\_\{i\\in\[K\]\}H\_\{i\}andΔ=max\{2,⌊ℓmin/\(B\+1\)⌋\}\\Delta=\\max\\\{2,\\lfloor\\ell\_\{\\min\}/\(B\+1\)\\rfloor\\\}\. The shortest lengthℓmin\\ell\_\{\\min\}restricts candidates to token boundaries that are present in every sampled response, so that the same boundaryppcan be applied to allKKresponses when forming exact prefix groups\. The separation parameterΔ\\Deltaprevents the fork budget from being spent on several nearby high\-surprisal tokens from the same local uncertainty region, encouraging the selected boundaries to cover distinct parts of the response\. We form the candidate list𝒞=\{\(zi,t,i,t\):i∈\[K\],1≤t<ℓmin\},\\mathcal\{C\}=\\\{\(z\_\{i,t\},i,t\):i\\in\[K\],\\ 1\\leq t<\\ell\_\{\\min\}\\\},and sort𝒞\\mathcal\{C\}in descending order ofzi,tz\_\{i,t\}\. We initialize𝒫=∅\\mathcal\{P\}=\\emptysetand scan this sorted list greedily\. When considering a candidate triple\(zi,t,i,t\)\(z\_\{i,t\},i,t\), we propose the boundaryp=tp=t, corresponding to the prefix immediately before the high\-surprisal tokenat\(i\)a\_\{t\}^\{\(i\)\}\. The boundaryppis accepted if either𝒫=∅\\mathcal\{P\}=\\emptysetorminp′∈𝒫\|p−p′\|≥Δ\\min\_\{p^\{\\prime\}\\in\\mathcal\{P\}\}\|p\-p^\{\\prime\}\|\\geq\\Delta\. If accepted, we update𝒫←𝒫∪\{p\}\\mathcal\{P\}\\leftarrow\\mathcal\{P\}\\cup\\\{p\\\}\. The scan stops once\|𝒫\|=B\|\\mathcal\{P\}\|=Bor the candidate list is exhausted\. Finally, we sort the accepted boundaries and write𝒫=\{p1,…,pb\}\\mathcal\{P\}=\\\{p\_\{1\},\\ldots,p\_\{b\}\\\}withp1<⋯<pbp\_\{1\}<\\cdots<p\_\{b\}andb≤Bb\\leq B\. Accepting a boundaryppdoes not select a single response\-specific prefix, as the same boundaryppis applied to allKKsampled responses, and grouping is performed by exact equality of the token sequencesy<p\(i\)y\_\{<p\}^\{\(i\)\}\.
Step 3: Retained prefix nodes\.For each selected boundaryp∈𝒫p\\in\\mathcal\{P\}, we compare the length\-pptoken prefixes of all sampled responses\. For a token prefixuu, define the descendant index setI\(p,u\)=\{i∈\[K\]:y<p\(i\)=u\}I\(p,u\)=\\\{i\\in\[K\]:y\_\{<p\}^\{\(i\)\}=u\\\}, withn\(p,u\)=\|I\(p,u\)\|n\(p,u\)=\|I\(p,u\)\|\. The equalityy<p\(i\)=uy\_\{<p\}^\{\(i\)\}=uis exact token\-prefix matching: two responses are grouped together only if their firstpptokens are identical\. Ifn\(p,u\)≥2n\(p,u\)\\geq 2, the pairv=\(p,u\)v=\(p,u\)is retained as a prefix node\. Singleton prefixes are discarded because they do not provide a reusable prefix\-level value estimate\. Let𝒩fork\\mathcal\{N\}\_\{\\mathrm\{fork\}\}denote the set of retained prefix nodes\. Forv=\(p,u\)v=\(p,u\), writep\(v\)=pp\(v\)=p,u\(v\)=uu\(v\)=u,I\(v\)=I\(p,u\)I\(v\)=I\(p,u\), andn\(v\)=n\(p,u\)n\(v\)=n\(p,u\)\. We also define a virtual root nodev0v\_\{0\}, corresponding to the empty prefix before any token is generated, withp\(v0\)=0p\(v\_\{0\}\)=0,I\(v0\)=\[K\]I\(v\_\{0\}\)=\[K\], andn\(v0\)=Kn\(v\_\{0\}\)=K\. Its value is the batch\-average rewardV^\(v0\)=r¯\\widehat\{V\}\(v\_\{0\}\)=\\bar\{r\}\. Each retained nodevvreceives the empirical prefix valueV^\(v\)=1n\(v\)∑i∈I\(v\)ri\.\\widehat\{V\}\(v\)=\\frac\{1\}\{n\(v\)\}\\sum\_\{i\\in I\(v\)\}r\_\{i\}\.Intuitively,V^\(v\)\\widehat\{V\}\(v\)estimates how good completions tend to be after passing through the exact token prefix represented by nodevv, whileV^\(v0\)\\widehat\{V\}\(v\_\{0\}\)is the prompt\-level average over all sampled completions\.
Step 4: Span\-level advantages\.For each responseii, collect all retained nodes whose prefixes lie on the response:𝒩i=\{v=\(p,u\)∈𝒩fork:y<p\(i\)=u\}\.\\mathcal\{N\}\_\{i\}=\\\{v=\(p,u\)\\in\\mathcal\{N\}\_\{\\mathrm\{fork\}\}:y\_\{<p\}^\{\(i\)\}=u\\\}\.Sort them by boundary position to obtain the pathv1\(i\),…,vqi\(i\)v\_\{1\}^\{\(i\)\},\\ldots,v\_\{q\_\{i\}\}^\{\(i\)\}, where0<p\(v1\(i\)\)<⋯<p\(vqi\(i\)\)<Hi0<p\(v\_\{1\}^\{\(i\)\}\)<\\cdots<p\(v\_\{q\_\{i\}\}^\{\(i\)\}\)<H\_\{i\}, andqiq\_\{i\}is the number of retained prefix nodes on responseii\. Setv0\(i\)=v0v\_\{0\}^\{\(i\)\}=v\_\{0\}andp\(v0\(i\)\)=0p\(v\_\{0\}^\{\(i\)\}\)=0\. This path partitions responseiiinto spans between consecutive retained prefix nodes and a final tail span\. For an internal span ending at nodevj\(i\)v\_\{j\}^\{\(i\)\}, LEAF uses the global\-plus\-local advantageAj\(i\)=\(GA\(vj\(i\)\)\+LA\(vj\(i\)\)\)/n\(vj\(i\)\),A\_\{j\}^\{\(i\)\}=\\left\(\\mathrm\{GA\}\(v\_\{j\}^\{\(i\)\}\)\+\\mathrm\{LA\}\(v\_\{j\}^\{\(i\)\}\)\\right\)/\\sqrt\{n\(v\_\{j\}^\{\(i\)\}\)\},whereGA\(vj\(i\)\)=V^\(vj\(i\)\)−V^\(v0\)\\mathrm\{GA\}\(v\_\{j\}^\{\(i\)\}\)=\\widehat\{V\}\(v\_\{j\}^\{\(i\)\}\)\-\\widehat\{V\}\(v\_\{0\}\)andLA\(vj\(i\)\)=V^\(vj\(i\)\)−V^\(vj−1\(i\)\)\.\\mathrm\{LA\}\(v\_\{j\}^\{\(i\)\}\)=\\widehat\{V\}\(v\_\{j\}^\{\(i\)\}\)\-\\widehat\{V\}\(v\_\{j\-1\}^\{\(i\)\}\)\.GA\\mathrm\{GA\}measures whether the prefix is better than the prompt\-level average, andLA\\mathrm\{LA\}measures whether the current span moved the response to a better prefix than the previous retained prefix on the same path\. The factor1/n\(vj\(i\)\)1/\\sqrt\{n\(v\_\{j\}^\{\(i\)\}\)\}is a multiplicity correction: if many sampled responses share the same prefix, the same token span appears many times in the batch, and the correction prevents that duplicated span from dominating the update solely because it is repeated\.
For eachj=1,…,qij=1,\\ldots,q\_\{i\}, every tokent∈\[p\(vj−1\(i\)\),p\(vj\(i\)\)\)t\\in\[p\(v\_\{j\-1\}^\{\(i\)\}\),p\(v\_\{j\}^\{\(i\)\}\)\)receives the same advantageAj\(i\)A\_\{j\}^\{\(i\)\}\. Thus, the advantage is not constant over the whole response; it is piecewise constant over spans between retained prefix nodes\. For the final tail span, we treat the terminal response as the leaf value\. Letvlast\(i\)=vqi\(i\)v\_\{\\mathrm\{last\}\}^\{\(i\)\}=v\_\{q\_\{i\}\}^\{\(i\)\}ifqi\>0q\_\{i\}\>0, andvlast\(i\)=v0v\_\{\\mathrm\{last\}\}^\{\(i\)\}=v\_\{0\}otherwise\. We setAtail\(i\)=\(ri−V^\(v0\)\)\+\(ri−V^\(vlast\(i\)\)\)\.A\_\{\\mathrm\{tail\}\}^\{\(i\)\}=\(r\_\{i\}\-\\widehat\{V\}\(v\_\{0\}\)\)\+\(r\_\{i\}\-\\widehat\{V\}\(v\_\{\\mathrm\{last\}\}^\{\(i\)\}\)\)\.Every tokent∈\[p\(vlast\(i\)\),Hi\)t\\in\[p\(v\_\{\\mathrm\{last\}\}^\{\(i\)\}\),H\_\{i\}\)receivesAtail\(i\)A\_\{\\mathrm\{tail\}\}^\{\(i\)\}\. If responseiihas no retained prefix node, then the whole response is a single tail span and this gives2\(ri−V^\(v0\)\)2\(r\_\{i\}\-\\widehat\{V\}\(v\_\{0\}\)\), which has the same direction as the GRPO response\-level advantage before normalization\. In practice, the raw span advantages are normalized across the sampled batch before entering the loss in Eq\. \([1](https://arxiv.org/html/2606.07610#S3.E1)\)\.
## 4Theoretical Insights
We provide theoretical insights into three design choices of LEAF: the span\-level advantage estimator, the fork budgetBB, and surprisal\-based boundary selection\. Specifically, we show that the empirical advantage estimator that LEAF uses preserves the correct update direction, we characterize the tradeoff regarding the value ofBBand highlight he issues with pure prefix matching\. Due to limited space, we state concise informal versions of the results here and defer the formal statements and proofs to Appendix[E](https://arxiv.org/html/2606.07610#A5)\.
###### Theorem 1\(Validity of span\-level advantages \(informal\)\)\.
For any span ending at prefixUjU\_\{j\}, the root\-relative differenceV\(Uj\)−V\(U0\)V\(U\_\{j\}\)\-V\(U\_\{0\}\)and the parent\-relative differenceV\(Uj\)−V\(Uj−1\)V\(U\_\{j\}\)\-V\(U\_\{j\-1\}\)are both valid scalar advantages\. Under fixed\-prefix conditional sampling,V^\(v\)\\widehat\{V\}\(v\)is an unbiased estimate ofV\(v\)V\(v\)with variance scaling as1/n\(v\)1/n\(v\)\. Thus, shared\-prefix averaging denoises the value before LEAF assigns span\-level credit\.
###### Theorem 2\(Fork budget tradeoff \(informal\)\)\.
The fork budgetBBincreases the number of selected boundaries at which prefix values may be estimated\. With rollout budgetKK, at mostB⌊K/2⌋B\\lfloor K/2\\rfloornon\-root prefix nodes can be retained\. In addition, increasingBBgives finer credit assignment but can create more low\-support value estimates\. WithK=8K=8each boundary can retain at most four nodes; usingB=2B=2keeps the worst\-case number of non\-root prefix values at88while still allowing internal span\-level credit\.
###### Theorem 3\(Pure prefix matching is shallow\-biased \(informal\)\)\.
Prefix\-collision mass is monotone non\-increasing with boundary depth, so a selector that maximizes collisions alone selects from the earliest high\-collision plateau\. Hence, pure prefix matching can spend the fork budget on shallow boilerplate prefixes\. LEAF separates the two roles: surprisal selects where to inspect, while exact prefix matching decides which prefixes at that boundary have enough support to retain\.
## 5Experimental Setup
Datasets\.We evaluate LEAF on four speech\-language benchmarks covering short\-form SQA, long\-form SQA, and AST\. For short\-form SQA, we use Part I of LibriSQAZhaoet al\.\([2025](https://arxiv.org/html/2606.07610#bib.bib30)\), an open\-ended QA benchmark over LibriSpeech audioPanayotovet al\.\([2015](https://arxiv.org/html/2606.07610#bib.bib21)\)\. For long\-form SQA, we use two LongAudio\-derived settingsGhoshet al\.\([2025](https://arxiv.org/html/2606.07610#bib.bib19)\): DailyTalkLeeet al\.\([2023](https://arxiv.org/html/2606.07610#bib.bib22)\), after removing multiple\-choice examples, and a Europarl/VoxPopuli splitKoehn \([2005](https://arxiv.org/html/2606.07610#bib.bib23)\); Wanget al\.\([2021](https://arxiv.org/html/2606.07610#bib.bib20)\), which we refer to as LongAudio\. We restrict both long\-form settings to audio segments of at most4040seconds\. For AST, we use CoVoST2 English\-to\-GermanWanget al\.\([2020](https://arxiv.org/html/2606.07610#bib.bib24)\)\. We use dataset\-provided instructions when available and only adapt input formatting to each SALLM; split details are in Appendix[B\.2](https://arxiv.org/html/2606.07610#A2.SS2)\.
Baselines and backbones\.We compare LEAF with the state\-of\-the\-art GRPO speech\-RL baseline ofElmakieset al\.\([2026](https://arxiv.org/html/2606.07610#bib.bib1)\)\. Their reported results use full\-parameter tuning; our controlled LEAF\-vs\.\-GRPO experiments use LoRA for both methods to isolate the RL objective under the same trainable\-parameter budget\. Due to compute constraints, we do not replicate their full\-parameter GRPO baseline\. Moreover,Elmakieset al\.\([2026](https://arxiv.org/html/2606.07610#bib.bib1)\)report the rollout budget but not other training details, such as the number of epochs or learning rate, which makes an exact replication difficult\. When direct comparison to their reported numbers is possible, we include it; otherwise, we use our LoRA GRPO implementation\. For rollouts, we match their reported budget withK=8K=8responses per prompt\. LEAF uses the same rollout budget with fork budgetB=2B=2; all other optimization and decoding hyperparameters are matched between LEAF and GRPO and reported in Appendix[B\.1](https://arxiv.org/html/2606.07610#A2.SS1)\. We evaluate Granite Speech 3\.3 2B/8BSaonet al\.\([2025](https://arxiv.org/html/2606.07610#bib.bib25)\), Granite\-4\.0\-1B\-SpeechTeam \([2026](https://arxiv.org/html/2606.07610#bib.bib26)\), and Qwen2\-Audio\-7BChuet al\.\([2024](https://arxiv.org/html/2606.07610#bib.bib27)\)\.
Evaluation metrics\.To ensure fair comparison with the existing state\-of\-the\-artElmakieset al\.\([2026](https://arxiv.org/html/2606.07610#bib.bib1)\), we report BLEU, ROUGE\-1/2/L, METEOR, and BERTScore\-F1 using temperature0\.90\.9and top\-p=0\.9p=0\.9decoding\. To complement automatic metrics, we use M\-Prometheus\-14BPombalet al\.\([2025](https://arxiv.org/html/2606.07610#bib.bib28)\)for rubric\-grounded Likert\-5 scoring and GEMBA\-style DA\-100Kocmi and Federmann \([2023](https://arxiv.org/html/2606.07610#bib.bib29)\)with Qwen2\.5\-14B\-InstructQwenet al\.\([2025](https://arxiv.org/html/2606.07610#bib.bib31)\)for continuous per\-item scores\.111We exclude Qwen2\-Audio\-7B from DA\-100 evaluation to avoid family bias\.DA\-100 scores are averaged overN=5N=5judgments atT=0\.7T=0\.7and used for tail\-risk analysis\. We also perform matched A/B comparisons using JudgeLM\-13BZhuet al\.\([2025](https://arxiv.org/html/2606.07610#bib.bib32)\): each anonymized pair is judged in both orders, ties are allowed, and only order\-consistent verdicts are counted to reduce position bias\. Pairwise uses greedy decoding\.
Statistical analysis\.We report95%95\\%confidence intervals for judge means,CVaR@α\\mathrm\{CVaR\}@\\alpha, and threshold fractions usingB=1000B=1000non\-parametric bootstrap resamples\. Tail risk is measured byCVaR@α\\mathrm\{CVaR\}@\\alpha, the mean score over the worstα\\alphafraction of examplesRockafellaret al\.\([2000](https://arxiv.org/html/2606.07610#bib.bib33)\)\. Because Likert\-5CVaR@10\\mathrm\{CVaR\}@10saturates at the minimum score whenever at least10%10\\%of examples receive score11, we use continuous GEMBA\-DA scores for the main tail\-risk analysis\.
## 6When Does LEAF Work?
Before presenting the main results, we ask whether LEAF has the raw structure it needs to operate\. IT relies on an empirical property of i\.i\.d\. rollout groups: sampled responses must share exact prefixes often enough, and at sufficiently deep token positions, for the induced virtual forks to provide useful credit\-assignment signal\.
We call this property*forkability*\. To define it, considerKKi\.i\.d\. rollouts for a fixed prompt\. For a selected boundarytt, group responses by exact equality of their length\-ttprefixes\. A*fork\-fire*occurs atttif at least one prefix group has size at least two\. A fork\-fire is*usable*if at least one such non\-singleton group has non\-trivial reward variation, e\.g\.,maxi,j∈G\|ri−rj\|\>τr\\max\_\{i,j\\in G\}\|r\_\{i\}\-r\_\{j\}\|\>\\tau\_\{r\}for some retained prefix groupGGand toleranceτr≥0\\tau\_\{r\}\\geq 0\. Thus, usable fork\-fires correspond to shared prefixes whose descendant responses lead to distinguishable outcomes, rather than merely producing duplicate completions\. The usable fork\-fire raterfr\_\{f\}is the fraction of LEAF\-selected boundary attempts that are usable\. We measure forkability directly across four speech\-understanding datasets and three SALLM backbones, a3×43\\\!\\times\\\!4design with≈36,000\\approx\\\!\\\!36\{,\}000measured fork attempts\. The usable fork\-fire rate is≥72\.9%\\geq 72\.9\\%in every one of the twelve\(backbone,task\)\(\\text\{backbone\},\\text\{task\}\)cells and pooled rates differ by≤1\.4%\\leq 1\.4\\%across backbones, indicating that forkability is a property of speech\-understanding tasks rather than the model\.
Direct measurement\.We run LEAF from each of the twelve cells and run it on a held\-out validation\+test split ofN=3000−3400N\\\!=\\\!3000\\\!\-\\\!3400examples per cell\. Table[1](https://arxiv.org/html/2606.07610#S6.T1)reports the pooled rates with95%95\\%per\-position bootstrap CIs\. The pooled\-by\-backbone rates differ by1\.41\.4pps, recovering backbone\-invariance\. The pooled\-by\-task rates span2020pp and follow task difficulty \(easier QA tasks fork more reliably than translation/long\-form transcription\)\. Across all twelve cells,rfr\_\{f\}never drops below72\.9%72\.9\\%; the per\-cell breakdown \(Tab\.[8](https://arxiv.org/html/2606.07610#A3.T8)\) is in the appendix\. Figure[2](https://arxiv.org/html/2606.07610#S6.F2)plots usable rate vs\. token position; for each task, the three backbone curves overlap within their95%95\\%CIs at every position, giving an analogue of the pooled claim\.
Table 1:Usable fork\-fire raterfr\_\{f\}, pooled by backbone \(top\) and by task \(middle\)\. Confidence intervals are per\-position non\-parametric bootstrap \(nboot=5000n\_\{\\text\{boot\}\}\\\!=\\\!5000\); positions are the unit of resampling because rollouts within an example are correlated through the sampler\. Pooled\-by\-backbone rates lie within1\.41\.4pp of each other\.Figure 2:Usable fork\-fire rate vs\. token position, one panel per task, three backbones overlaid\. Shaded bands are pointwise95%95\\%CIs; positions with fewer than1515attempts are dropped to suppress small\-nnnoise;xx\-axis is clipped to position4040\. Backbones overlap within their CIs in every panel, supporting backbone\-invariance\.Forks branch in distinct outcomes, and reach the response\.A high rate alone is not enough: fires must produce non\-singleton groups with discriminating rewards, and must occur deep enough into the response to discriminate body tokens rather than just the opening\. Per\-cell statistics in the appendix \(Tab\.[9](https://arxiv.org/html/2606.07610#A3.T9),[10](https://arxiv.org/html/2606.07610#A3.T10)\) confirm both: mean group sizes lie in4\.0–6\.24\.0\\text\{\-\-\}6\.2of theK=8K\\\!=\\\!8budget, within\-group BLEU spread is strictly positive in every cell, and the90%90\\%\-cumulative depthP90P\_\{90\}ranges from1414to3939token positions across cells — forks reach into the body, particularly on long\-form transcription whereP90≥35P\_\{90\}\\\!\\geq\\\!35across all three backbones\.
## 7Results
Table 2:End\-to\-end results across datasets, backbones, and training methods\. Each row is the best\-BLEU checkpoint per setting\. M\-Prometheus Likert is a 1–5 rubric\-grounded judge; GEMBA\-DA is a 0–100 continuous score\.Table 3:Tail\-risk analysis on per\-item GEMBA\-DA continuous scores\. CVaR@α\\alphais the mean of the worstα\\alphafraction of items; higher is better\. Bootstrap 95% CIs useB=1000B\{=\}1000resamples\.Table 4:Pairwise A/B judgement of LEAF vs\. GRPO with position\-swap consistency aggregation\. ‘consist\.’ is the fraction of order\-consistent verdicts; net adv\.==LEAF win\-rate−\-GRPO win\-rate on consistent pairs\.We organize the results around two comparisons\. First, we compare LEAF against the published speech\-RL state of the art on LibriSQA, where the evaluation setup is closely reproducible\. Second, we compare LEAF against our own LoRA GRPO baseline across all datasets, which controls for prompts, decoding, rollout budget, trainable\-parameter budget, and evaluation pipeline\.
### 7\.1Comparison against the state\-of\-the\-art
Table[2](https://arxiv.org/html/2606.07610#S7.T2)includes the published full\-parameter GRPO results ofElmakieset al\.\([2026](https://arxiv.org/html/2606.07610#bib.bib1)\)on LibriSQA\. Despite using only LoRA adapters, LEAF improves over their full\-parameter GRPO numbers on the matched Granite Speech backbones\. On Granite 3\.3 2B, LEAF improves over full\-parameter GRPO while training only133\.7133\.7M parameters instead of3\.013\.01B, a22\.5×22\.5\\timesreduction in trainable parameters\. On Granite 3\.3 8B, LEAF trains218\.9218\.9M parameters instead of8\.658\.65B, a39\.5×39\.5\\timesreduction, while again improving the reported LibriSQA performance\. The smaller Granite\-4\.0\-1B\-Speech model trained with LEAF uses only102\.2102\.2M trainable parameters and approaches the published full\-parameter Granite 3\.3 2B GRPO result, further suggesting that improved credit assignment can compensate for a substantially smaller trainable\-parameter budget\.
We restrict cross\-paper comparison to LibriSQA\. For CoVoST2, the frozen\-base Granite 3\.3 2B/8B results reported byElmakieset al\.\([2026](https://arxiv.org/html/2606.07610#bib.bib1)\)are substantially higher than the frozen\-base performance obtained under our prompts and input formatting\. This mismatch suggests that the CoVoST2 setups differ in prompt template or input formatting, not only in the RL algorithm\. Since all prompts used in our experiments are the default ones provided in the datasets, we treat our LoRA GRPO implementation as the controlled CoVoST2 baseline and do not include the published CoVoST2 numbers\.
### 7\.2Controlled comparison against GRPO
Table[2](https://arxiv.org/html/2606.07610#S7.T2)gives the controlled comparison between LEAF and LoRA GRPO across all datasets and backbones\. In every matched setting, LEAF improves over LoRA GRPO on BLEU and on the judge\-based quality scores\. The BLEU gains are\+2\.80\+2\.80and\+1\.92\+1\.92on CoVoST2 for Granite 3 2B and 8B,\+5\.05\+5\.05on DailyTalk,\+5\.09\+5\.09on LibriSQA with Granite 3 2B,\+4\.90\+4\.90on LibriSQA with Qwen2\-Audio, and\+8\.68\+8\.68on LongAudio\. The gains are largest on the long\-form QA settings, where assigning one terminal\-reward advantage to every token is most likely to dilute span\-level signals\.
The judge\-based metrics show the same trend\. On GEMBA\-DA, LEAF improves over LoRA GRPO in every matched cell:\+4\.15\+4\.15and\+1\.96\+1\.96on CoVoST2,\+6\.37\+6\.37on DailyTalk,\+1\.75\+1\.75and\+2\.16\+2\.16on LibriSQA, and\+3\.10\+3\.10on LongAudio\. LEAF also improves Likert scores in every matched cell\. Thus, the gains are not limited to n\-gram overlap metrics, but persist under rubric\-grounded and continuous LLM\-as\-judge evaluation\.
Tail\-risk\.Table[3](https://arxiv.org/html/2606.07610#S7.T3)shows that LEAF improves worst\-case behavior on top of average quality\. On LibriSQA, it raises CVaR@10 from24\.2724\.27to32\.0432\.04for Granite 3\.3 2B and from19\.0619\.06to29\.3129\.31for Qwen2\-Audio\. On DailyTalk and LongAudio, LEAF improves CVaR@10 by4\.944\.94and7\.737\.73, respectively\. On CoVoST2, it improves CVaR@25 and reduces the fraction of examples below5050in both Granite settings\. This suggests that LEAF reduces catastrophic failures, consistent with its ability to localize negative credit to harmful spans rather than broadcasting it across the full response\.
Pairwise preference\.Table[4](https://arxiv.org/html/2606.07610#S7.T4)reports direct A/B comparisons between LEAF and GRPO on matched examples\. LEAF has positive net win advantage in every evaluated setting:\+7\.6\+7\.6points on LibriSQA with Granite 3 2B,\+3\.9\+3\.9with Qwen2\-Audio,\+11\.4\+11\.4on DailyTalk, and\+21\.3\+21\.3on LongAudio under JudgeLM\. On CoVoST2, M\-Prometheus relative grading gives LEAF positive net advantages of\+17\.7\+17\.7and\+14\.0\+14\.0points for Granite 3 2B and 8B\. These pairwise results confirm that the improvements in judge metrics correspond to direct output\-level preferences\.
Length efficiency\.LEAF also produces more compact outputs than LoRA GRPO\. In every matched QA setting, LEAF uses fewer predicted tokens:19\.7→18\.519\.7\\\!\\to\\\!18\.5on LibriSQA with Granite 3\.3 2B,20\.1→18\.820\.1\\\!\\to\\\!18\.8with Qwen2\-Audio,19\.1→15\.319\.1\\\!\\to\\\!15\.3on DailyTalk, and40\.5→30\.240\.5\\\!\\to\\\!30\.2on LongAudio\. The effect is strongest for long\-form audio, where GRPO tends to over\-generate\. This supports the intended mechanism of LEAF: by assigning separate credit to tail spans, the method can penalize unnecessary continuations without diluting the signal across the entire response\.
## 8Conclusion
In this work, we introduced LEAF, a retrospective tree\-based RL method for SALLM post\-training\. LEAF reuses the same i\.i\.d\. rollout groups collected by GRPO, detects shared prefixes at high\-surprisal boundaries, and converts descendant rewards into span\-level advantages\. This yields process\-aware credit assignment without extra decoding, online tree search, or a separate process reward model\. Across speech QA and translation tasks, LEAF consistently improves over LoRA GRPO under the same rollout and trainable\-parameter budget, while also improving tail behavior, pairwise preferences, and length efficiency\. On LibriSQA, LEAF further improves over published full\-parameter GRPO while training only LoRA adapters\. Future work includes richer reward models, multilingual speech tasks, and adaptive boundary selection\.
## Limitations and Societal Impact
Limitations\.We note several limitations\. First, LEAF’s effectiveness is regime\-dependent: its mechanism requires that i\.i\.d\. rollouts share prefixes often enough to populate non\-trivial fork groups, and the absolute fork rate depends on the policy\. We therefore state our claims comparatively \(backbone\-invariance, task\-monotonicity\) and as a lower bound \(rf≥72\.9%r\_\{f\}\\geq 72\.9\\%in every cell\) rather than as an exact rate\. Second, our reward signal is corpus BLEU, following prior speech\-RL work; smoother rewards may yield further gains we do not measure\. Third, hardware constrained us to three NVIDIA A40 GPUs \(48 GB\) with a single training seed per \(model, method\): we report decoding\-seed standard deviations where available but do not characterize training\-seed variance\. Finally, all tasks are English speech understanding, so our findings may not transfer to other languages or modalities\.
Societal Impact\.LEAF targets multimodal \(speech\) models, whose input processing carries higher compute and energy cost per example than text\-only systems; methods that add rollouts or tree expansion increase this cost further\. As with any work improving speech understanding, downstream deployment raises questions of reliability and safety, since transcription or understanding errors can propagate into user\-facing decisions, and of attribution, as it becomes harder to trace whether an output reflects the source audio faithfully\.
## Acknowledgements
AG gratefully acknowledges Moulik Choraria for introducing him to tree\-based methods for post\-training large language models\.
## References
- S\. Arora, K\. Chang, C\. Chien, Y\. Peng, H\. Wu, Y\. Adi, E\. Dupoux, H\. Lee, K\. Livescu, and S\. Watanabe \(2025\)On the landscape of spoken language models: a comprehensive survey\.Transactions on Machine Learning Research\.Note:External Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=BvxaP3sVbA)Cited by:[§1](https://arxiv.org/html/2606.07610#S1.p1.1)\.
- Y\. Chu, J\. Xu, Q\. Yang, H\. Wei, X\. Wei, Z\. Guo, Y\. Leng, Y\. Lv, J\. He, J\. Lin, C\. Zhou, and J\. Zhou \(2024\)Qwen2\-audio technical report\.External Links:2407\.10759,[Link](https://arxiv.org/abs/2407.10759)Cited by:[§5](https://arxiv.org/html/2606.07610#S5.p2.2)\.
- A\. Elmakies, H\. Aronowitz, N\. Shabtay, E\. Schwartz, R\. Hoory, and A\. Dekel \(2026\)Advancing speech understanding in speech\-aware language models with grpo\.InICASSP 2026 \- 2026 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),Vol\.,pp\. 17797–17801\.External Links:[Document](https://dx.doi.org/10.1109/ICASSP55912.2026.11463105)Cited by:[§2](https://arxiv.org/html/2606.07610#S2.p1.1),[§5](https://arxiv.org/html/2606.07610#S5.p2.2),[§5](https://arxiv.org/html/2606.07610#S5.p3.4),[§7\.1](https://arxiv.org/html/2606.07610#S7.SS1.p1.7),[§7\.1](https://arxiv.org/html/2606.07610#S7.SS1.p2.1)\.
- S\. Ghosh, A\. Goel, J\. Kim, S\. Kumar, Z\. Kong, S\. Lee, C\. Yang, R\. Duraiswami, D\. Manocha, R\. Valle, and B\. Catanzaro \(2025\)Audio flamingo 3: advancing audio intelligence with fully open large audio language models\.InAdvances in Neural Information Processing Systems,D\. Belgrave, C\. Zhang, H\. Lin, R\. Pascanu, P\. Koniusz, M\. Ghassemi, and N\. Chen \(Eds\.\),Vol\.38,pp\. 41819–41886\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2025/file/3babb6b453cb59d87cb58a1219ef914b-Paper-Conference.pdf)Cited by:[§B\.2](https://arxiv.org/html/2606.07610#A2.SS2.p1.1),[§5](https://arxiv.org/html/2606.07610#S5.p1.1)\.
- Y\. Guo, L\. Xu, J\. Liu, Y\. Dan, and S\. Qiu \(2026\)Segment policy optimization: effective segment\-level credit assignment in RL for large language models\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=9osvTOYbT4)Cited by:[§2](https://arxiv.org/html/2606.07610#S2.p2.1)\.
- Z\. Hou, Z\. Hu, Y\. Li, R\. Lu, J\. Tang, and Y\. Dong \(2025\)TreeRL: LLM reinforcement learning with on\-policy tree search\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 12355–12369\.External Links:[Link](https://aclanthology.org/2025.acl-long.604/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.604),ISBN 979\-8\-89176\-251\-0Cited by:[§2](https://arxiv.org/html/2606.07610#S2.p2.1),[§3\.2](https://arxiv.org/html/2606.07610#S3.SS2.p2.3)\.
- E\. J\. Hu, yelong shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=nZeVKeeFYf9)Cited by:[§1](https://arxiv.org/html/2606.07610#S1.p4.1)\.
- Y\. Ji, Z\. Ma, Y\. Wang, G\. Chen, X\. Chu, and L\. Wu \(2026\)Tree search for LLM agent reinforcement learning\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=ZpQwAFhU13)Cited by:[§2](https://arxiv.org/html/2606.07610#S2.p2.1)\.
- A\. Kazemnejad, M\. Aghajohari, E\. Portelance, A\. Sordoni, S\. Reddy, A\. Courville, and N\. L\. Roux \(2025\)VinePPO: refining credit assignment in RL training of LLMs\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=Myx2kJFzAn)Cited by:[§2](https://arxiv.org/html/2606.07610#S2.p2.1)\.
- T\. Kocmi and C\. Federmann \(2023\)Large language models are state\-of\-the\-art evaluators of translation quality\.InProceedings of the 24th Annual Conference of the European Association for Machine Translation,M\. Nurminen, J\. Brenner, M\. Koponen, S\. Latomaa, M\. Mikhailov, F\. Schierl, T\. Ranasinghe, E\. Vanmassenhove, S\. A\. Vidal, N\. Aranberri, M\. Nunziatini, C\. P\. Escartín, M\. Forcada, M\. Popovic, C\. Scarton, and H\. Moniz \(Eds\.\),Tampere, Finland,pp\. 193–203\.External Links:[Link](https://aclanthology.org/2023.eamt-1.19/)Cited by:[§5](https://arxiv.org/html/2606.07610#S5.p3.4)\.
- P\. Koehn \(2005\)Europarl: a parallel corpus for statistical machine translation\.InProceedings of Machine Translation Summit X: Papers,Phuket, Thailand,pp\. 79–86\.External Links:[Link](https://aclanthology.org/2005.mtsummit-papers.11/)Cited by:[§5](https://arxiv.org/html/2606.07610#S5.p1.1)\.
- K\. Lee, K\. Park, and D\. Kim \(2023\)DailyTalk: spoken dialogue dataset for conversational text\-to\-speech\.InICASSP 2023 \- 2023 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),Vol\.,pp\. 1–5\.External Links:[Document](https://dx.doi.org/10.1109/ICASSP49357.2023.10095751)Cited by:[§B\.2](https://arxiv.org/html/2606.07610#A2.SS2.p1.1),[§5](https://arxiv.org/html/2606.07610#S5.p1.1)\.
- Y\. Li, Q\. Gu, Z\. Wen, Z\. Li, T\. Xing, S\. Guo, T\. Zheng, X\. Zhou, X\. Qu, W\. Zhou, Z\. Zhang, W\. Shen, Q\. Liu, C\. Lin, J\. Yang, G\. Zhang, and W\. Huang \(2025\)TreePO: bridging the gap of policy optimization and efficacy and inference efficiency with heuristic tree\-based modeling\.External Links:2508\.17445,[Link](https://arxiv.org/abs/2508.17445)Cited by:[§2](https://arxiv.org/html/2606.07610#S2.p2.1)\.
- C\. Liu, Y\. Hu, Y\. Gao, S\. Zhang, and Z\. Ling \(2025\)Group relative policy optimization for text\-to\-speech with large language models\.External Links:2509\.18798,[Link](https://arxiv.org/abs/2509.18798)Cited by:[§2](https://arxiv.org/html/2606.07610#S2.p1.1)\.
- V\. Panayotov, G\. Chen, D\. Povey, and S\. Khudanpur \(2015\)Librispeech: an asr corpus based on public domain audio books\.In2015 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),Vol\.,pp\. 5206–5210\.External Links:[Document](https://dx.doi.org/10.1109/ICASSP.2015.7178964)Cited by:[§5](https://arxiv.org/html/2606.07610#S5.p1.1)\.
- J\. Pombal, D\. Yoon, P\. Fernandes, I\. Wu, S\. Kim, R\. Rei, G\. Neubig, and A\. Martins \(2025\)M\-prometheus: a suite of open multilingual LLM judges\.InSecond Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=Atyk8lnIQQ)Cited by:[§5](https://arxiv.org/html/2606.07610#S5.p3.4)\.
- Qwen, :, A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Tang, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu \(2025\)Qwen2\.5 technical report\.External Links:2412\.15115,[Link](https://arxiv.org/abs/2412.15115)Cited by:[§5](https://arxiv.org/html/2606.07610#S5.p3.4)\.
- R\. T\. Rockafellar, S\. Uryasev,et al\.\(2000\)Optimization of conditional value\-at\-risk\.Journal of risk2,pp\. 21–42\.Cited by:[§5](https://arxiv.org/html/2606.07610#S5.p4.8)\.
- G\. Saon, A\. Dekel, A\. Brooks, T\. Nagano, A\. Daniels, A\. Satt, A\. Mittal, B\. Kingsbury, D\. Haws, E\. Morais, G\. Kurata, H\. Aronowitz, I\. Ibrahim, J\. Kuo, K\. Soule, L\. Lastras, M\. Suzuki, R\. Hoory, S\. Thomas, S\. Novitasari, T\. Fukuda, V\. Sunder, X\. Cui, and Z\. Kons \(2025\)Granite\-speech: open\-source speech\-aware llms with strong english asr capabilities\.External Links:2505\.08699,[Link](https://arxiv.org/abs/2505.08699)Cited by:[§5](https://arxiv.org/html/2606.07610#S5.p2.2)\.
- J\. Schulman and T\. M\. Lab \(2025\)LoRA without regret\.Thinking Machines Lab: Connectionism\.Note:https://thinkingmachines\.ai/blog/lora/External Links:[Document](https://dx.doi.org/10.64434/tml.20250929)Cited by:[§1](https://arxiv.org/html/2606.07610#S1.p4.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo \(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.External Links:2402\.03300,[Link](https://arxiv.org/abs/2402.03300)Cited by:[§1](https://arxiv.org/html/2606.07610#S1.p1.1),[§2](https://arxiv.org/html/2606.07610#S2.p1.1)\.
- P\. G\. Shivakumar, Y\. Gu, A\. Gandhe, and I\. Bulyko \(2025\)Group relative policy optimization for speech recognition\.External Links:2509\.01939,[Link](https://arxiv.org/abs/2509.01939)Cited by:[§2](https://arxiv.org/html/2606.07610#S2.p1.1)\.
- I\. G\. S\. Team \(2026\)Granite 4\.0 speech\.External Links:[Link](https://huggingface.co/ibm-granite/granite-4.0-1b-speech)Cited by:[§5](https://arxiv.org/html/2606.07610#S5.p2.2)\.
- H\. Tran, Z\. Yao, and H\. Yu \(2025\)Exploiting tree structure for credit assignment in rl training of llms\.External Links:2509\.18314,[Link](https://arxiv.org/abs/2509.18314)Cited by:[§2](https://arxiv.org/html/2606.07610#S2.p2.1)\.
- C\. Wang, M\. Riviere, A\. Lee, A\. Wu, C\. Talnikar, D\. Haziza, M\. Williamson, J\. Pino, and E\. Dupoux \(2021\)VoxPopuli: a large\-scale multilingual speech corpus for representation learning, semi\-supervised learning and interpretation\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\),C\. Zong, F\. Xia, W\. Li, and R\. Navigli \(Eds\.\),Online,pp\. 993–1003\.External Links:[Link](https://aclanthology.org/2021.acl-long.80/),[Document](https://dx.doi.org/10.18653/v1/2021.acl-long.80)Cited by:[§5](https://arxiv.org/html/2606.07610#S5.p1.1)\.
- C\. Wang, A\. Wu, and J\. Pino \(2020\)CoVoST 2 and massively multilingual speech\-to\-text translation\.External Links:2007\.10310,[Link](https://arxiv.org/abs/2007.10310)Cited by:[§B\.2](https://arxiv.org/html/2606.07610#A2.SS2.p1.1),[§5](https://arxiv.org/html/2606.07610#S5.p1.1)\.
- S\. Wang, L\. Yu, C\. Gao, C\. Zheng, S\. Liu, R\. Lu, K\. Dang, X\. Chen, J\. Yang, Z\. Zhang, Y\. Liu, A\. Yang, A\. Zhao, Y\. Yue, S\. Song, B\. Yu, G\. Huang, and J\. Lin \(2026\)Beyond the 80/20 rule: high\-entropy minority tokens drive effective reinforcement learning for LLM reasoning\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=yfcpdY4gMP)Cited by:[§2](https://arxiv.org/html/2606.07610#S2.p2.1)\.
- Z\. Yang, Z\. Guo, Y\. Huang, X\. Liang, Y\. Wang, and J\. Tang \(2025\)TreeRPO: tree relative policy optimization\.External Links:2506\.05183,[Link](https://arxiv.org/abs/2506.05183)Cited by:[§2](https://arxiv.org/html/2606.07610#S2.p2.1)\.
- Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, Y\. Yue, W\. Dai, T\. Fan, G\. Liu, L\. Liu, X\. Liu, H\. Lin, Z\. Lin, B\. Ma, G\. Sheng, Y\. Tong, C\. Zhang, M\. Zhang, W\. Zhang, H\. Zhu, J\. Zhu, J\. Chen, J\. Chen, C\. Wang, H\. Yu, Y\. Song, X\. Wei, H\. Zhou, J\. Liu, W\. Ma, Y\. Zhang, L\. Yan, M\. Qiao, Y\. Wu, and M\. Wang \(2025\)DAPO: an open\-source llm reinforcement learning system at scale\.External Links:2503\.14476,[Link](https://arxiv.org/abs/2503.14476)Cited by:[§1](https://arxiv.org/html/2606.07610#S1.p1.1)\.
- Z\. Zhao, Y\. Jiang, H\. Liu, Y\. Wang, and Y\. Wang \(2025\)LibriSQA: a novel dataset and framework for spoken question answering with large language models\.IEEE Transactions on Artificial Intelligence6\(11\),pp\. 2884–2895\.External Links:[Document](https://dx.doi.org/10.1109/TAI.2024.3427069)Cited by:[§B\.2](https://arxiv.org/html/2606.07610#A2.SS2.p1.1),[§5](https://arxiv.org/html/2606.07610#S5.p1.1)\.
- L\. Zhu, X\. Wang, and X\. Wang \(2025\)JudgeLM: fine\-tuned large language models are scalable judges\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=xsELpEPn4A)Cited by:[§5](https://arxiv.org/html/2606.07610#S5.p3.4)\.
## Appendix AHardware details
All experiments were run using three NVIDIA A40 GPUs \(48 GB\)\.
## Appendix BExperimental Details
### B\.1Training Parameters
Table 5:Shared RL training hyperparameters\. Unless otherwise stated, the same settings are used for LEAF and GRPO\.
### B\.2Dataset Specifics
To create the LibriSQA and the CoVoST2 datasets we follow the process shared in the original worksWanget al\.\([2020](https://arxiv.org/html/2606.07610#bib.bib24)\); Zhaoet al\.\([2025](https://arxiv.org/html/2606.07610#bib.bib30)\)\. In order to create the DailyTalk dataset, we use the prompts and data given in LongAudioGhoshet al\.\([2025](https://arxiv.org/html/2606.07610#bib.bib19)\)and integrate it accordingly with the original dataLeeet al\.\([2023](https://arxiv.org/html/2606.07610#bib.bib22)\), while filtering audio durationns less than or equal to 40 seconds\.
Table 6:Dataset splits used in this work\.Table 7:Number of RL training epochs\. All other RL hyperparameters are shared as in Table[5](https://arxiv.org/html/2606.07610#A2.T5)\.
### B\.3Prompt templates
We list the prompt templates used for each dataset\. The literal placeholder<\|audio\|\>is rewritten at tokenization time to the backbone\-specific audio token: a single audio token for Granite Speech and<\|audio\_bos\|\><\|AUDIO\|\><\|audio\_eos\|\>for Qwen2\-Audio\.
#### LibriSQA\.
We use the dataset\-provided question field and wrap it with one of the following templates:
- •Listen to the audio <\|audio\|\> and answer the following question: \{question\}
- •<\|audio\|\> Please answer this question about the spoken passage: \{question\}
- •You will hear speech in <\|audio\|\>\. Answer: \{question\}
#### CoVoST2 English\-to\-German\.
We use translation\-instruction templates such as:
- •<\|audio\|\> Listen to the speech and translate it into German\.
- •Translate the audio <\|audio\|\> from English to German\.
- •You will hear English speech in <\|audio\|\>\. Provide the German translation\.
#### DailyTalk and LongAudio\.
For the LongAudio\-derived datasets, prompts are inherited from the dataset preparation pipeline\. They follow two structural variants:
- •<\|audio\|\>\\n\{question\}
- •\{question\}\\n<\|audio\|\>
where\{question\}is the dataset\-provided open\-ended question\.
### B\.4LLM\-as\-judge protocols
#### Prediction staging\.
For each dataset, backbone, method, and checkpoint, we generate a single stochastic prediction file using temperature0\.90\.9, top\-p=0\.9p=0\.9, and maximum generation length200200\. All judge protocols consume the same staged prediction files, so judges are never given access to additional samples\.
#### Grounded Likert\-5\.
We useUnbabel/M\-Prometheus\-14Bas a rubric\-grounded judge on a11–55Likert scale\. The judge receives the question, reference answer, candidate prediction, and a task\-specific rubric\. For QA tasks, the rubric emphasizes correctness, groundedness, completeness, and hallucination\. For CoVoST2, the rubric emphasizes accuracy, fluency, and style\. Decoding is greedy with temperature0\.
#### GEMBA\-DA0–100100\.
For continuous per\-item scoring, we use a GEMBA\-style direct\-assessment prompt withQwen2\.5\-14B\-Instruct\. Each item is judgedN=5N=5times at temperature0\.70\.7, and we use the per\-item mean score as the canonical DA\-100 value\.
#### Pairwise judging\.
For pairwise LEAF\-vs\.\-GRPO comparisons, we useBAAI/JudgeLM\-13B\-v1\.0for QA tasks and M\-Prometheus relative grading for CoVoST2\. Each pair is evaluated in both candidate orders\. A verdict is counted only when the two swapped evaluations agree; otherwise, the pair is marked inconsistent and excluded from the decided\-pair win rate\.
## Appendix CForkability: per\-cell measurements
This appendix expands Section[6](https://arxiv.org/html/2606.07610#S6)with \(i\) the per\-cell breakdown of the usable fork\-fire rate, \(ii\) the auxiliary fork\-group statistics that justify treating fires as discriminating branches rather than near\-clones, \(iii\) the cumulative\-position depth quantiles, and \(iv\) a backbone\-organised view of the per\-position curves\.
### C\.1Per\-cell usable fork\-fire rate
Table[8](https://arxiv.org/html/2606.07610#A3.T8)gives the1212\-cell breakdown that underlies the pooled rows of Tab\.[1](https://arxiv.org/html/2606.07610#S6.T1)\. The minimum cell rate is72\.9%72\.9\\%\(Granite\-3\.3\-8B on LongAudio\) and the maximum is94\.6%94\.6\\%\(Granite\-4\.0\-1B on LibriSQA\)\. Within every task the three backbone CIs overlap; between tasks, the LibriSQA / DailyTalk / CoVoST2 / LongAudio ordering recovers the task\-monotonicity claim\.
Table 8:Per\-cell usable fork\-fire raterfr\_\{f\}with95%95\\%per\-position bootstrap CIs \(nboot=5000n\_\{\\text\{boot\}\}\\\!=\\\!5000\)\.NNis the number of validation\+test examples measured per cell\.
### C\.2Fork\-group statistics
Table[9](https://arxiv.org/html/2606.07610#A3.T9)reports the mean number of rollouts sharing each fired prefix \(“Group size”, maximumK=8K\\\!=\\\!8\), the mean within\-group BLEU spread \(“Reward spread”\), the usable\-count weighted mean position of fired forks \(“MU pos”\), and the number of distinct token positions hosting at least one usable fork \(“\# pos”\)\. Group sizes range4\.0–6\.24\.0\\text\{\-\-\}6\.2across cells and reward spread is strictly positive in every cell \(≥0\.027\\geq 0\.027\), confirming that fired forks discriminate higher\- from lower\-reward branches rather than collapsing to duplicates\.
Table 9:Auxiliary per\-cell fork\-group statistics\.
### C\.3Depth profile of usable fork\-fires
Table[10](https://arxiv.org/html/2606.07610#A3.T10)reports cumulative\-position quantiles:PqP\_\{q\}is the token position by whichq%q\\%of the cell’s usable fork\-fires have already occurred\. The depth profile clusters tightly across backbones within a task — the forkable window is governed by the task’s input–output coupling, not by the model\. Short\-output QA tasks concentrate fork\-fires within the first≈18\\approx\\\!18tokens \(DailyTalk:P90≤18P\_\{90\}\\leq 18across all three backbones\), while long\-form LongAudio maintains fork activity past position3535\(P90=35–39P\_\{90\}=35\\text\{\-\-\}39,P99=55–59P\_\{99\}=55\\text\{\-\-\}59\)\.
Table 10:Cumulative\-position quantiles of usable fork\-fires per cell\. The depth profile is essentially constant across backbones within a task\.
### C\.4Per\-backbone view of the per\-position curves
Figure[3](https://arxiv.org/html/2606.07610#A3.F3)replots Fig\.[2](https://arxiv.org/html/2606.07610#S6.F2)with the three backbones placed in separate panels and the four tasks overlaid in each, making the between\-task separation visually direct\. The four task curves orderLibriSQA\>DailyTalk\>CoVoST2\>LongAudio\\text\{LibriSQA\}\>\\text\{DailyTalk\}\>\\text\{CoVoST2\}\>\\text\{LongAudio\}at every position beyond the startup region, on every backbone — the visual analogue of the pooled\-by\-task ordering in Tab\.[1](https://arxiv.org/html/2606.07610#S6.T1)\.
Figure 3:Per\-position usable fork\-fire rate, one panel per backbone, four tasks overlaid; shaded bands are pointwise95%95\\%Wilson CIs\.
## Appendix DAdditional Experimental Results
### D\.1Length Experiments
Table 11:Length efficiency: mean GPT\-2 BPE token counts of the prediction \(post\-hoc, universal tokenizer; same for every row\)\.Δbase=\\Delta\_\{\\text\{base\}\}=LEAF−\-base;ΔGRPO=\\Delta\_\{\\text\{GRPO\}\}=LEAF−\-GRPO\. Negative==LEAF produces shorter outputs than the comparator\. ‘–’ indicates the corresponding checkpoint was not trained / not staged\.
## Appendix ETheoretical Proofs
This appendix gives the formal statements and proofs for the theoretical claims in the main text\. Throughout, we fix a promptxx\. LetY∼πθ\(⋅∣x\)Y\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)be a random completion and letR=r\(x,Y\)R=r\(x,Y\)be its terminal reward\. The observed rewardsrir\_\{i\}are realizations ofRRfor sampled completionsy\(i\)y^\{\(i\)\}\. We assumeR∈\[0,Rmax\]R\\in\[0,R\_\{\\max\}\]when using concentration inequalities\.
###### Assumption 1\(Fixed\-prefix conditional sampling\)\.
Fix a boundaryppand a prefixu∈𝒱pu\\in\\mathcal\{V\}^\{p\}\. Conditional onn\(p,u\)=mn\(p,u\)=m, the rewards of the sampled descendants\{ri:y<p\(i\)=u\}\\\{r\_\{i\}:y\_\{<p\}^\{\(i\)\}=u\\\}are conditionally independent samples from the distribution ofRRgivenY<p=uY\_\{<p\}=u\. We writeV\(u\)=𝔼\[R∣Y<p=u\]V\(u\)=\\mathbb\{E\}\[R\\mid Y\_\{<p\}=u\]\.
###### Assumption 2\(Selected\-node concentration\)\.
For each retained nodevv, letVsel\(v\)V\_\{\\mathrm\{sel\}\}\(v\)be its selection\-conditional value\. We assume that for allϵ\>0\\epsilon\>0,
ℙ\(\|V^\(v\)−Vsel\(v\)\|≥ϵ\)≤2exp\(−2n\(v\)ϵ2Rmax2\)\.\\mathbb\{P\}\\left\(\|\\widehat\{V\}\(v\)\-V\_\{\\mathrm\{sel\}\}\(v\)\|\\geq\\epsilon\\right\)\\leq 2\\exp\\left\(\-\\frac\{2n\(v\)\\epsilon^\{2\}\}\{R\_\{\\max\}^\{2\}\}\\right\)\.
### E\.1Proof of the span\-advantage claim
###### Theorem 4\(Validity of LEAF span advantages\)\.
For a fixed promptxx, letY∼πθ\(⋅∣x\)Y\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\),R=r\(x,Y\)R=r\(x,Y\), andV\(u\)=𝔼\[R∣Y<\|u\|=u\]V\(u\)=\\mathbb\{E\}\[R\\mid Y\_\{<\|u\|\}=u\]\. Consider a path of prefixesU0≺U1≺⋯≺UqU\_\{0\}\\prec U\_\{1\}\\prec\\cdots\\prec U\_\{q\}, withU0=∅U\_\{0\}=\\varnothing, and letZj=∑t=τj−1τj−1∇θlogπθ\(At∣St\)Z\_\{j\}=\\sum\_\{t=\\tau\_\{j\-1\}\}^\{\\tau\_\{j\}\-1\}\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(A\_\{t\}\\mid S\_\{t\}\)be the score of spanjj\. DefineGA\(Uj\)=V\(Uj\)−V\(U0\)\\mathrm\{GA\}\(U\_\{j\}\)=V\(U\_\{j\}\)\-V\(U\_\{0\}\)andLA\(Uj\)=V\(Uj\)−V\(Uj−1\)\\mathrm\{LA\}\(U\_\{j\}\)=V\(U\_\{j\}\)\-V\(U\_\{j\-1\}\)\. Then𝔼\[GA\(Uj\)Zj\]=𝔼\[LA\(Uj\)Zj\]=𝔼\[RZj\]\\mathbb\{E\}\[\\mathrm\{GA\}\(U\_\{j\}\)Z\_\{j\}\]=\\mathbb\{E\}\[\\mathrm\{LA\}\(U\_\{j\}\)Z\_\{j\}\]=\\mathbb\{E\}\[RZ\_\{j\}\]\. Moreover, if a retained nodev=\(p,u\)v=\(p,u\)hasn\(v\)=mn\(v\)=mdescendants, then under Assumption[1](https://arxiv.org/html/2606.07610#Thmassumption1),𝔼\[V^\(v\)∣n\(v\)=m\]=V\(u\)\\mathbb\{E\}\[\\widehat\{V\}\(v\)\\mid n\(v\)=m\]=V\(u\)andVar\(V^\(v\)∣n\(v\)=m\)=Var\(R∣Y<p=u\)/m\\operatorname\{Var\}\(\\widehat\{V\}\(v\)\\mid n\(v\)=m\)=\\operatorname\{Var\}\(R\\mid Y\_\{<p\}=u\)/m\.
###### Proof\.
Fix a spanjjand a tokent∈\[τj−1,τj\)t\\in\[\\tau\_\{j\-1\},\\tau\_\{j\}\)\. SinceUj=Y<τjU\_\{j\}=Y\_\{<\\tau\_\{j\}\}containsAtA\_\{t\}and its prefix stateStS\_\{t\}, the score term is measurable with respect toUjU\_\{j\}\. By the tower property,
𝔼\[R∇θlogπθ\(At∣St\)\]\\displaystyle\\mathbb\{E\}\\left\[R\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(A\_\{t\}\\mid S\_\{t\}\)\\right\]=𝔼\[V\(Uj\)∇θlogπθ\(At∣St\)\]\.\\displaystyle\\quad=\\mathbb\{E\}\\left\[V\(U\_\{j\}\)\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(A\_\{t\}\\mid S\_\{t\}\)\\right\]\.\(2\)For any random variableBBthat is measurable with respect toStS\_\{t\},
𝔼\[B∇θlogπθ\(At∣St\)\]\\displaystyle\\mathbb\{E\}\\left\[B\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(A\_\{t\}\\mid S\_\{t\}\)\\right\]=𝔼\[B𝔼\[∇θlogπθ\(At∣St\)∣St\]\]=0,\\displaystyle\\quad=\\mathbb\{E\}\\left\[B\\,\\mathbb\{E\}\\left\[\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(A\_\{t\}\\mid S\_\{t\}\)\\mid S\_\{t\}\\right\]\\right\]=0,\(3\)because𝔼\[∇θlogπθ\(At∣St\)∣St\]=0\\mathbb\{E\}\[\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(A\_\{t\}\\mid S\_\{t\}\)\\mid S\_\{t\}\]=0\. TakingB=V\(U0\)B=V\(U\_\{0\}\)in \([3](https://arxiv.org/html/2606.07610#A5.E3)\) and combining with \([2](https://arxiv.org/html/2606.07610#A5.E2)\) gives the identity forGA\(Uj\)\\mathrm\{GA\}\(U\_\{j\}\)\. TakingB=V\(Uj−1\)B=V\(U\_\{j\-1\}\), which is measurable with respect toStS\_\{t\}for everyt∈\[τj−1,τj\)t\\in\[\\tau\_\{j\-1\},\\tau\_\{j\}\), gives the identity forLA\(Uj\)\\mathrm\{LA\}\(U\_\{j\}\)\. Summing overt=τj−1,…,τj−1t=\\tau\_\{j\-1\},\\ldots,\\tau\_\{j\}\-1proves𝔼\[GA\(Uj\)Zj\]=𝔼\[LA\(Uj\)Zj\]=𝔼\[RZj\]\\mathbb\{E\}\[\\mathrm\{GA\}\(U\_\{j\}\)Z\_\{j\}\]=\\mathbb\{E\}\[\\mathrm\{LA\}\(U\_\{j\}\)Z\_\{j\}\]=\\mathbb\{E\}\[RZ\_\{j\}\]\.
For the empirical claim, condition onn\(v\)=mn\(v\)=mfor a fixed prefix nodev=\(p,u\)v=\(p,u\)\. By Assumption[1](https://arxiv.org/html/2606.07610#Thmassumption1), the descendant rewards are conditionally independent samples fromR∣Y<p=uR\\mid Y\_\{<p\}=u\. Their mean isV\(u\)V\(u\)and their variance isVar\(R∣Y<p=u\)\\operatorname\{Var\}\(R\\mid Y\_\{<p\}=u\)\. SinceV^\(v\)=m−1∑i∈I\(v\)ri\\widehat\{V\}\(v\)=m^\{\-1\}\\sum\_\{i\\in I\(v\)\}r\_\{i\}, the conditional mean isV\(u\)V\(u\)and the conditional variance isVar\(R∣Y<p=u\)/m\\operatorname\{Var\}\(R\\mid Y\_\{<p\}=u\)/m\. ∎
### E\.2Proof of the fork\-budget claim
###### Theorem 5\(Fork\-budget support\-resolution tradeoff\)\.
Fix rollout budgetKK, fork budgetB≥1B\\geq 1, and retention thresholdm0≥2m\_\{0\}\\geq 2, withK≥m0K\\geq m\_\{0\}\. Let𝒩fork\(m0\)\\mathcal\{N\}\_\{\\mathrm\{fork\}\}^\{\(m\_\{0\}\)\}be the retained non\-root prefix nodes obtained by applying at mostBBselected boundaries to theKKsampled responses and keeping only prefix groups with at leastm0m\_\{0\}descendants\. Then\|𝒩fork\(m0\)\|≤B⌊K/m0⌋\|\\mathcal\{N\}\_\{\\mathrm\{fork\}\}^\{\(m\_\{0\}\)\}\|\\leq B\\lfloor K/m\_\{0\}\\rfloor\. For a selected retained nodev=\(p,u\)v=\(p,u\), define its selection\-conditional value asVsel\(v\)=𝔼\[RJ∣v∈𝒩fork\(m0\)\]V\_\{\\mathrm\{sel\}\}\(v\)=\\mathbb\{E\}\[R\_\{J\}\\mid v\\in\\mathcal\{N\}\_\{\\mathrm\{fork\}\}^\{\(m\_\{0\}\)\}\], whereJJis drawn uniformly from the descendant setI\(v\)I\(v\)\. Under Assumption[2](https://arxiv.org/html/2606.07610#Thmassumption2), ifnmin=minv∈𝒩fork\(m0\)n\(v\)n\_\{\\min\}=\\min\_\{v\\in\\mathcal\{N\}\_\{\\mathrm\{fork\}\}^\{\(m\_\{0\}\)\}\}n\(v\), then with probability at least1−δ1\-\\delta,maxv∈𝒩fork\(m0\)\|V^\(v\)−Vsel\(v\)\|≤Rmaxlog\(2B⌊K/m0⌋/δ\)/\(2nmin\)\\max\_\{v\\in\\mathcal\{N\}\_\{\\mathrm\{fork\}\}^\{\(m\_\{0\}\)\}\}\|\\widehat\{V\}\(v\)\-V\_\{\\mathrm\{sel\}\}\(v\)\|\\leq R\_\{\\max\}\\sqrt\{\\log\(2B\\lfloor K/m\_\{0\}\\rfloor/\\delta\)/\(2n\_\{\\min\}\)\}\. If selection does not distort the descendant reward law beyond conditioning on the retained prefix, thenVsel\(v\)=V\(u\)V\_\{\\mathrm\{sel\}\}\(v\)=V\(u\)\.
###### Proof\.
Fix a selected boundarypp\. Exact prefix matching partitions theKKsampled responses into disjoint groups\. Since each retained group has size at leastm0m\_\{0\}, there are at most⌊K/m0⌋\\lfloor K/m\_\{0\}\\rfloorretained groups at boundarypp\. Since at mostBBboundaries are selected,
\|𝒩fork\(m0\)\|≤B⌊Km0⌋\.\|\\mathcal\{N\}\_\{\\mathrm\{fork\}\}^\{\(m\_\{0\}\)\}\|\\leq B\\left\\lfloor\\frac\{K\}\{m\_\{0\}\}\\right\\rfloor\.
If𝒩fork\(m0\)=∅\\mathcal\{N\}\_\{\\mathrm\{fork\}\}^\{\(m\_\{0\}\)\}=\\varnothing, the high\-probability statement is vacuous\. Otherwise, letnmin=minvn\(v\)n\_\{\\min\}=\\min\_\{v\}n\(v\)over retained nodes\. By Assumption[2](https://arxiv.org/html/2606.07610#Thmassumption2), for any retained nodevv,
ℙ\(\|V^\(v\)−Vsel\(v\)\|≥ϵ\)≤2exp\(−2n\(v\)ϵ2Rmax2\)\.\\mathbb\{P\}\\left\(\|\\widehat\{V\}\(v\)\-V\_\{\\mathrm\{sel\}\}\(v\)\|\\geq\\epsilon\\right\)\\leq 2\\exp\\left\(\-\\frac\{2n\(v\)\\epsilon^\{2\}\}\{R\_\{\\max\}^\{2\}\}\\right\)\.Sincen\(v\)≥nminn\(v\)\\geq n\_\{\\min\}, a union bound gives
ℙ\(maxv\|V^\(v\)−Vsel\(v\)\|≥ϵ\)\\displaystyle\\mathbb\{P\}\\left\(\\max\_\{v\}\|\\widehat\{V\}\(v\)\-V\_\{\\mathrm\{sel\}\}\(v\)\|\\geq\\epsilon\\right\)≤2\|𝒩fork\(m0\)\|exp\(−2nminϵ2Rmax2\)\\displaystyle\\quad\\leq 2\|\\mathcal\{N\}\_\{\\mathrm\{fork\}\}^\{\(m\_\{0\}\)\}\|\\exp\\left\(\-\\frac\{2n\_\{\\min\}\\epsilon^\{2\}\}\{R\_\{\\max\}^\{2\}\}\\right\)≤2B⌊Km0⌋exp\(−2nminϵ2Rmax2\)\.\\displaystyle\\quad\\leq 2B\\left\\lfloor\\frac\{K\}\{m\_\{0\}\}\\right\\rfloor\\exp\\left\(\-\\frac\{2n\_\{\\min\}\\epsilon^\{2\}\}\{R\_\{\\max\}^\{2\}\}\\right\)\.\(4\)Setting the last expression equal toδ\\deltaand solving forϵ\\epsilonproves the stated bound\. Finally, if selection does not distort the descendant reward law beyond the retained prefixuu, thenVsel\(v\)=𝔼\[R∣Y<p=u\]=V\(u\)V\_\{\\mathrm\{sel\}\}\(v\)=\\mathbb\{E\}\[R\\mid Y\_\{<p\}=u\]=V\(u\)\. ∎
### E\.3Proof of the prefix\-matching claim
###### Theorem 6\(Collision\-only prefix matching is shallow\-biased\)\.
Let𝒜=\{p1<⋯<pM\}\\mathcal\{A\}=\\\{p\_\{1\}<\\cdots<p\_\{M\}\\\}be the admissible boundaries\. Forp∈𝒜p\\in\\mathcal\{A\}, defineUp=Y<pU\_\{p\}=Y\_\{<p\},κ\(p\)=∑u∈𝒱pℙ\(Up=u\)2\\kappa\(p\)=\\sum\_\{u\\in\\mathcal\{V\}^\{p\}\}\\mathbb\{P\}\(U\_\{p\}=u\)^\{2\}, andη\(p\)=Var\(𝔼\[R∣Up\]\)/Var\(R\)\\eta\(p\)=\\operatorname\{Var\}\(\\mathbb\{E\}\[R\\mid U\_\{p\}\]\)/\\operatorname\{Var\}\(R\)whenVar\(R\)\>0\\operatorname\{Var\}\(R\)\>0\. Thenκ\(p\)\\kappa\(p\)is non\-increasing inpp\. Hence any collision\-only selectorpcoll∈argmaxp∈𝒜κ\(p\)p\_\{\\mathrm\{coll\}\}\\in\\arg\\max\_\{p\\in\\mathcal\{A\}\}\\kappa\(p\)selects from the earliest plateau\{p∈𝒜:κ\(p\)=κ\(p1\)\}\\\{p\\in\\mathcal\{A\}:\\kappa\(p\)=\\kappa\(p\_\{1\}\)\\\}; ifκ\(p1\)\>κ\(p2\)\\kappa\(p\_\{1\}\)\>\\kappa\(p\_\{2\}\), it uniquely selectsp1p\_\{1\}\. Moreover, there existp0<p⋆p\_\{0\}<p\_\{\\star\}such thatκ\(p0\)\>κ\(p⋆\)\>0\\kappa\(p\_\{0\}\)\>\\kappa\(p\_\{\\star\}\)\>0, whileη\(p0\)=0\\eta\(p\_\{0\}\)=0andη\(p⋆\)=1\\eta\(p\_\{\\star\}\)=1\.
###### Proof\.
We first prove thatκ\(p\)\\kappa\(p\)is non\-increasing\. The prefix partition at depthp\+1p\+1refines the prefix partition at depthpp\. Fix a length\-ppprefixuu, and letqa=ℙ\(Up\+1=\(u,a\)\)q\_\{a\}=\\mathbb\{P\}\(U\_\{p\+1\}=\(u,a\)\)\. Thenℙ\(Up=u\)=∑aqa\\mathbb\{P\}\(U\_\{p\}=u\)=\\sum\_\{a\}q\_\{a\}, and
∑aqa2≤\(∑aqa\)2=ℙ\(Up=u\)2\.\\sum\_\{a\}q\_\{a\}^\{2\}\\leq\\left\(\\sum\_\{a\}q\_\{a\}\\right\)^\{2\}=\\mathbb\{P\}\(U\_\{p\}=u\)^\{2\}\.Summing over allu∈𝒱pu\\in\\mathcal\{V\}^\{p\}givesκ\(p\+1\)≤κ\(p\)\\kappa\(p\+1\)\\leq\\kappa\(p\)\. Thus, by induction,pa<pbp\_\{a\}<p\_\{b\}impliesκ\(pa\)≥κ\(pb\)\\kappa\(p\_\{a\}\)\\geq\\kappa\(p\_\{b\}\)\.
It follows that any maximizer ofκ\(p\)\\kappa\(p\)over𝒜=\{p1<⋯<pM\}\\mathcal\{A\}=\\\{p\_\{1\}<\\cdots<p\_\{M\}\\\}lies in the earliest plateau\{p∈𝒜:κ\(p\)=κ\(p1\)\}\\\{p\\in\\mathcal\{A\}:\\kappa\(p\)=\\kappa\(p\_\{1\}\)\\\}\. Ifκ\(p1\)\>κ\(p2\)\\kappa\(p\_\{1\}\)\>\\kappa\(p\_\{2\}\), then no deeper admissible boundary can matchκ\(p1\)\\kappa\(p\_\{1\}\), sop1p\_\{1\}is the unique maximizer\.
It remains to show that high collision need not imply reward information\. Consider two admissible boundariesp0<p⋆p\_\{0\}<p\_\{\\star\}\. Suppose all responses share the same prefix atp0p\_\{0\}\. ThenUp0U\_\{p\_\{0\}\}is deterministic, soκ\(p0\)=1\\kappa\(p\_\{0\}\)=1\. Since𝔼\[R∣Up0\]\\mathbb\{E\}\[R\\mid U\_\{p\_\{0\}\}\]is constant,η\(p0\)=0\\eta\(p\_\{0\}\)=0\.
At boundaryp⋆p\_\{\\star\}, suppose there are two prefixesu0,u1u\_\{0\},u\_\{1\}, each with probability1/21/2, and letR=𝟏\{Up⋆=u1\}R=\\mathbf\{1\}\\\{U\_\{p\_\{\\star\}\}=u\_\{1\}\\\}\. Thenκ\(p⋆\)=\(1/2\)2\+\(1/2\)2=1/2\\kappa\(p\_\{\\star\}\)=\(1/2\)^\{2\}\+\(1/2\)^\{2\}=1/2, soκ\(p0\)\>κ\(p⋆\)\>0\\kappa\(p\_\{0\}\)\>\\kappa\(p\_\{\\star\}\)\>0\. Also,𝔼\[R∣Up⋆=u0\]=0\\mathbb\{E\}\[R\\mid U\_\{p\_\{\\star\}\}=u\_\{0\}\]=0and𝔼\[R∣Up⋆=u1\]=1\\mathbb\{E\}\[R\\mid U\_\{p\_\{\\star\}\}=u\_\{1\}\]=1, so𝔼\[R∣Up⋆\]=R\\mathbb\{E\}\[R\\mid U\_\{p\_\{\\star\}\}\]=R\. HenceVar\(𝔼\[R∣Up⋆\]\)=Var\(R\)\\operatorname\{Var\}\(\\mathbb\{E\}\[R\\mid U\_\{p\_\{\\star\}\}\]\)=\\operatorname\{Var\}\(R\), andη\(p⋆\)=1\\eta\(p\_\{\\star\}\)=1\. Therefore collision\-only selection can prefer a shallower boundary with larger collision but zero reward informativeness\. ∎
## Appendix FAI Usage
AI\-based assistants were used for limited editing, wording suggestions, and debugging assistance during experiment development and manuscript preparation\. All technical content, experimental design, analysis, and conclusions were verified and finalized by the authors\.Similar Articles
LEAF: A Living Benchmark for Event-Augmented Forecasting
LEAF is a living benchmark for evaluating large language models on event-augmented forecasting tasks, such as future event probabilities and time series forecasting. It uses a recursive retrieval agent system and dual-agent cross-validation to provide relevant auxiliary text, and shows that LLMs can leverage complex events to improve predictive performance.
Efficient and Trainable Language Model Test-Time Scaling via Local Branch Routing
Introduces Local Branch Routing (LBR), a token-level test-time scaling framework that expands a local lookahead tree and uses a lightweight router to select the best branch. LBR improves reasoning on mathematical benchmarks over chain-of-thought and other baselines.
Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax
This paper proposes using reinforcement learning with semantic rewards (via GRPO) to expand LLMs to low-resource languages without the typical alignment tax of catastrophic forgetting, showing improved semantic quality and transferability over supervised fine-tuning.
TALAN: Task-Aligned Latent Adaptation Networks for Targeted Post-Training of Large Language Models
TALAN introduces a sequence-conditioned latent side path for targeted post-training of large language models, achieving significant improvements on STEM/code benchmarks with minimal overhead.
LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models
LEAD dynamically adapts reasoning efficiency during training by using online calibration of correctness-efficiency trade-offs and adaptive problem-specific length targets, improving mathematical reasoning accuracy and reducing output length.