Skip a Layer or Loop It? Learning Program-of-Layers in LLMs
Summary
This paper introduces Program-of-Layers (PoLar), a method that allows LLMs to dynamically skip or loop pretrained layers per input, improving accuracy and efficiency over fixed-depth inference.
View Cached Full Text
Cached at: 06/08/26, 09:17 AM
# Skip a Layer or Loop It? Learning Program-of-Layers in LLMs
Source: [https://arxiv.org/html/2606.06574](https://arxiv.org/html/2606.06574)
###### Abstract
Large language models \(LLMs\) perform inference by following a fixed depth and order, non\-recurrent execution of all layers\. We reveal the wide existence of training\-free, flexible, dynamic “program\-of\-layers \(PoLar\)”, where pretrained layers can be packed as modules and then skipped or looped to form a customized program for each input\. For most inputs, substantially shorter program executions can achieve the same or better accuracy, while incorrect predictions of the original LLM can be corrected by alternative programs with fewer layers\. These observations indicate that inference admits multiple valid latent computations beyond the standard forward pass\. To efficiently achievePoLarin practice, we propose a lightweightPoLarprediction network, which learns to generate execution programs that dynamically skip or repeat pretrained layers for each input\. Experiments on mathematical reasoning benchmarks demonstrate thatPoLarconsistently improves accuracy over standard inference and prior dynamic\-depth methods, often while executing fewer layers, and that these gains persist under out\-of\-distribution evaluation\. Our results suggest that fixed\-depth execution captures only a narrow subset of an LLM’s latent reasoning capacity\.
Machine Learning, ICML
## 1Introduction
Figure 1:Program\-of\-layers \(PoLar\) for two different inputs\.TheDDlayers in a pretrained LLM defineDDfunctionsf0,…,fD−1\{f\_\{0\},\\ldots,f\_\{D\-1\}\}\. Instead of calling them in a static fixed order fromf0f\_\{0\}tofD−1f\_\{D\-1\}, the dynamic inference ofPoLarexecutes an*input\-specific program*π=\(i1,…,iK\)\\pi=\(i\_\{1\},\\ldots,i\_\{K\}\)that calls the functions with layer*skipping*and*recurrence*\.PoLarenables a training\-free architecture of dynamic depth for different inputs, yielding diverse latent computations that cannot be fully covered by existing methods\.Generalist foundation models, e\.g\., LLMs and VLMs, uniformly deploy a static, pre\-defined architecture to all inputs, despite their diversity and high variance in complexity and difficulty\(Liuet al\.,[2020](https://arxiv.org/html/2606.06574#bib.bib3); Xinet al\.,[2020](https://arxiv.org/html/2606.06574#bib.bib4); Zhouet al\.,[2020](https://arxiv.org/html/2606.06574#bib.bib5); Liuet al\.,[2021a](https://arxiv.org/html/2606.06574#bib.bib9)\)\. In contrast, conventional problem solving by programs can be more flexible and adaptive in algorithmic structures and complexity\. For example, an experienced programmer can save more steps and compute on easier tasks, and meanwhile knows how to scale up the space/time complexity to address more challenging problems\. However, these programs are specifically designed and optimized for every problem class, so they are not as general as LLMs\. This raises the questions:Is it always optimal and efficient to apply the same architecture or “program”, i\.e\., forward pass through all the layers in a fixed order, to different tasks? Can a generalist model further optimize its “program” applied to each input?
In this paper, we formulate layers in a pretrained LLM as a library of atomic functions that a program can call in arbitrary order for arbitrary times\. This formulation allows us to represent a dynamic model architecture during inference as a program\-of\-layers \(PoLar\) for each input, as illustrated in Figure[1](https://arxiv.org/html/2606.06574#S1.F1)\. As the first empirical study of its kind, we investigatePoLarbeyond the standard forward pass by Monte\-Carlo Tree Search \(MCTS\) and find that better \(more accurate and/or shorter\) programs almost always exist for every input task evaluated\. Unlike previous works on layer\-skipping/recurrence, early exit, and looped transformer\(Liuet al\.,[2020](https://arxiv.org/html/2606.06574#bib.bib3); Xinet al\.,[2020](https://arxiv.org/html/2606.06574#bib.bib4); Zhouet al\.,[2020](https://arxiv.org/html/2606.06574#bib.bib5); Fanet al\.,[2019](https://arxiv.org/html/2606.06574#bib.bib20),[2024](https://arxiv.org/html/2606.06574#bib.bib22); Yanget al\.,[2023](https://arxiv.org/html/2606.06574#bib.bib10)\), which only adopt one operation \(either skip or repeat\) to produce architectures of dynamic depths, our empirical study on the MCTS\-searched programs reveals that searching in a joint space of layer\-skip/repeat often discovers much better programs than those found in separate spaces\. While most effective programs can be shorter than the default, increasing program complexity via skip/repeat operations can substantially improve the output quality, especially on more difficult tasks\. In addition, most successful programs are predominantly composed of contiguous layer segments\. These observations not only verify the broad existence of betterPoLarwithout requiring any training, but also motivate a practicalPoLarprediction method that avoids the expensive cost of MCTS inPoLar’s large search space\. In particular, we aim to replace search\-based program discovery with a direct, inference\-time mechanism for generating execution programs\. Instead of enumerating or exploring execution paths for each input\(Liet al\.,[2025](https://arxiv.org/html/2606.06574#bib.bib38)\), our goal is to predict an input\-specific program\-of\-layers that determines how pretrained layers are executed during inference\. This shifts program selection from an online search problem to a single\-shot prediction problem, enabling practical deployment of program\-of\-layers inference in LLMs\.
To this end, we propose aPoLaralgorithm that predicts execution programs over frozen pretrained layers at inference time\. The predicted execution program specifies how pretrained layers are selectively skipped or recurrently applied and is executed once to produce the final output\. This design offers several advantages\. First, it makes program\-of\-layers inference computationally feasible by eliminating the need for expensive per\-input search\. Second, by jointly supporting layer skipping and recurrence within a unified execution framework,PoLarstrictly generalizes prior dynamic\-depth methods that are limited to a single form of execution control\. Third, it enables flexible test\-time computation scaling in fully frozen models, allowing inference to adapt to input difficulty while preserving model generality\.
We evaluatePoLaron a range of mathematical reasoning benchmarks using multiple pretrained LLMs\. Our results show thatPoLarconsistently improves accuracy over standard inference and prior dynamic\-depth methods, often while executing fewer layers on average\. Moreover, increasing the number of candidate execution programs yields strong test\-time computation scaling, and execution programs learned on in\-distribution data generalize effectively to out\-of\-distribution benchmarks across diverse domains\.
## 2Dynamic Inference as a Program\-of\-Layers \(PoLar\) in Large Language Models
Inference in pretrained LLMs is implemented as a fixed\-depth, fixed\-order forward pass: every input is processed by executing the same sequence of transformer layers\. Yet inputs to LLMs vary dramatically in difficulty\. Some are answered correctly with minimal reasoning, while others require complex, multi\-step computation\. This discrepancy raises a basic question:Is the standard forward pass sufficient for correct inference across diverse inputs?
One possibility is that this fixed computation is indeed sufficient for all cases\. Another is that correct prediction requires input\-dependent variation in computation\. In this work, we investigate the latter possibility\. Such variation can occur either in token space, through longer and more explicit chains of thought, or within the model’s hidden states, a form of computation we refer to as*latent reasoning*\.
Conjecture: Inference as Program\-of\-LayersDefine the layers in a pretrained LLM as functions, for each input, there can exist multiple distinct executions of programs\-of\-layers \(beyond the standard forward pass\) that produce correct predictions\.
Inference as the execution of a*program*\.In this view, inference is a step\-by\-step procedure that selects and composes pretrained modules\. The execution may vary across inputs in both length and order, while each module remains a fixed, pretrained function\.
Consider a pretrained LLM withDDtransformer layers, where each layer defines a fixed computation function
fi:ℝT×d→ℝT×d,i∈\{0,…,D−1\}\.\\textstyle f\_\{i\}:\\mathbb\{R\}^\{T\\times d\}\\rightarrow\\mathbb\{R\}^\{T\\times d\},\\quad i\\in\\\{0,\\ldots,D\-1\\\}\.A program is defined as a finite sequence of layer indices
π=\(i1,i2,…,iK\),ik∈\{0,…,D−1\},\\textstyle\\pi=\(i\_\{1\},i\_\{2\},\\ldots,i\_\{K\}\),\\quad i\_\{k\}\\in\\\{0,\\ldots,D\-1\\\},which induces the composed computation
Fπ=fiK∘⋯∘fi1\.\\textstyle F\_\{\\pi\}=f\_\{i\_\{K\}\}\\circ\\cdots\\circ f\_\{i\_\{1\}\}\.Executing a program applies this composition to the input and produces a prediction\. A program is considered*valid*if it yields a correct prediction for a given input\.
Figure 2:Sequential MCTS \(left\) vs\. End\-to\-endPoLarnetwork \(right\) for prediction of programs\.\(a\) MCTS in the space of execution programs via sequential iterations of selection, expansion, simulation, and backpropagation\. Each node represents a partial or complete execution program, and skip/repeat operations expand the search tree iteratively\. This explicit and thorough search is expensive and impractical\. \(b\) OurPoLartrains an end\-to\-end, lightweight prediction network that directly produces a program representation composed of \(i\) a binary mask𝐳seg\(x\)\\mathbf\{z\}^\{seg\}\(x\)segmenting layers into modules, and \(ii\) a vector of operation labels𝐳op\(x\)\\mathbf\{z\}^\{op\}\(x\)that applies one operation out of*skip*,*keep*, or*repeat*to each module\. Our method is scalable in practice and does not require sequential search\.Figure 3:Accuracy of MCTS discovered programs under varying execution\-depth budgetsacross five difficulty levels in DART\-Math\. We compare the original forward pass \(orange\) with 90–115% depth\-budgeted programs \(blue\)\. Shaded regions denote the maximum gain achieved under the highest budget \(115%\)\.Searching for valid execution programs\.We explore the space of execution programs using MCTS\. Execution programs are variable\-length sequences over pretrained transformer layers, allowing both skipping and repetition\. This space is large, discrete, and highly non\-convex, making exhaustive search infeasible\. MCTS provides a principled way to prioritize promising partial programs, enabling us to verify the existence of valid programs and analyze their structural properties\. We use MCTS strictly as a diagnostic tool rather than a practical inference\-time method; implementation details are given in Appendix[B](https://arxiv.org/html/2606.06574#A2)\. All experiments are conducted onDART\-Math\(Tonget al\.,[2024](https://arxiv.org/html/2606.06574#bib.bib33)\), a structured mathematical reasoning benchmark with five difficulty levels \(DM\-1 to DM\-5\)\. We evaluate four pretrained transformer models:LLaMA\-3\.2\-3B\-Instruct,Qwen1\.5\-MoE\-A2\.7B\-Chat,Qwen2\.5\-3B\-Instruct, andQwen3\-8B\.
We summarize our empirical findings below\.
Finding 1Layer recurrence\-only performs better than layer skipping\-only, but combining the two complementary operators produces the best program\-of\-layers\.
As shown in Table[1](https://arxiv.org/html/2606.06574#S2.T1), execution programs that allow layer recurrence \(Loop\) consistently outperform those that allow only layer skipping \(Skip\) across all evaluated models and difficulty levels\. Moreover, combining skipping with recurrence \(Skip&Loop\) yields substantially larger gains than either operation alone, achieving the highest accuracy in every setting reported in Table[1](https://arxiv.org/html/2606.06574#S2.T1)\. These results indicate that recurrence provides a stronger mechanism for improving inference than skipping alone, while the two operations play complementary roles when combined\.
Figure 4:Latent execution programs often admit shorter valid solutions\.We compare the standard forward\-pass depth with that of MCTS\-discovered valid programs, for initially correct \(C→\\rightarrowC\) and initially incorrect \(W→\\rightarrowC\) inputs\. Bars report total execution depth as a fraction of full model depth, with hatched overlays indicating effective depth \(the number of unique layers\)\.Figure 5:\(a\)Test\-time scaling via recurrence over layer segments\.Allowing more latent execution steps through segment recurrence leads to a monotonic increase in the probability of discovering valid execution programs across models\. \(b\)Recurrence and skipping are increasingly demanded for harder inputs\.The fraction of inputs relying on layer recurrence or skipping to be solved increases with increasing difficulty for most models, except LLaMA\-3\.2\-3B\-Instruct, whose deviation is explained by effective difficulty in Table[1](https://arxiv.org/html/2606.06574#S2.T1)\.Figure 6:Accuracy vs\. total layer executions\.For each model, we report how the average accuracy of valid execution programs changes with the total number of layer executions \(% of base model depth\)\. Across models and difficulty levels, accuracy increases with executed layers, revealing a consistent effect of depth\-scaling\.Figure 7:Structural bias of valid execution programs\.Valid programs rely primarily on contiguous layer segments as modules \(a\) and require at most one recurrence of each module \(b\)\.Table 1:Accuracy \(%\) of programs searched for DART\-Math \(Diff 1–5\) in different spaces: Base \(standard forward\), Skip \(layer skipping\), Loop \(layer recurrence\), and Skip&Loop \(skipping \+ recurrence\)\.Gainreports the absolute gain ofSkip&Loopover Base\.Boldindicates the best result, andunderlineindicates the second\-best result in each row\.MetricBaseSkipLoopSkip&LoopGainLLaMA\-3\.2\-3B\-InstructDM\-137\.945\.754\.984\.7\+46\.8DM\-228\.134\.846\.872\.3\+44\.2DM\-323\.229\.738\.065\.2\+42\.0DM\-422\.828\.235\.257\.0\+34\.2DM\-527\.131\.739\.059\.1\+32\.0Qwen1\.5\-MoE\-A2\.7B\-ChatDM\-137\.442\.757\.773\.2\+35\.8DM\-226\.432\.643\.759\.7\+33\.3DM\-320\.023\.834\.450\.4\+30\.4DM\-413\.916\.425\.138\.2\+24\.3DM\-510\.913\.320\.432\.4\+21\.5Qwen2\.5\-3B\-InstructDM\-125\.447\.060\.287\.4\+62\.0DM\-211\.233\.644\.376\.5\+65\.3DM\-34\.325\.135\.565\.0\+60\.7DM\-42\.015\.822\.251\.2\+49\.2DM\-51\.213\.218\.144\.5\+43\.3Qwen3\-8BDM\-140\.766\.068\.591\.3\+50\.6DM\-229\.350\.657\.482\.2\+52\.9DM\-318\.536\.040\.967\.1\+48\.6DM\-410\.423\.829\.253\.6\+43\.2DM\-58\.820\.023\.845\.7\+36\.9
Finding 2\(Occam’s razor\) Most valid execution programs are often shorter than the standard forward pass\.
As shown in Figure[3](https://arxiv.org/html/2606.06574#S2.F3), which reports the best accuracy obtained by MCTS under explicit budgets on total layer executions, many inputs remain solvable even when the overall computation is constrained to be significantly shorter than the standard forward pass\. Consistent with this trend, Figure[4](https://arxiv.org/html/2606.06574#S2.F4)shows that across models we frequently discover valid execution programs that require fewer layer applications than standard inference\. In particular, among inputs already solved correctly by standard inference \(C→\\rightarrowC\), 75\.5% admit shorter valid programs\. Even for inputs initially solved incorrectly \(W→\\rightarrowC\), 36\.2% admit shorter programs that correct the model’s prediction\. These results indicate that standard inference often over\-computes, and that correct inference can frequently be achieved with substantially fewer latent computation steps\.
Finding 3Increasing latent execution complexity systematically expands the space of valid programs and improves inference on harder inputs\.
While simple inputs often admit short execution programs \(Finding 2\), harder inputs demand greater test\-time execution complexity\. Across models and datasets, increasing execution depth and structural flexibility systematically expands valid program space and improves inference accuracy\.
\(a\) Test\-time scaling expands the space of valid execution programs for latent reasoning\.As shown in Figure[5](https://arxiv.org/html/2606.06574#S2.F5)\(a\), allocating more test\-time computation—via recurrence—monotonically increases the existence of valid execution programs across models\. This establishes test\-time scaling at latent reasoning: greater computation yields a larger feasible program space and higher correctness\.
\(b\) Harder inputs require more complex execution programs\.Figure[5](https://arxiv.org/html/2606.06574#S2.F5)\(b\) shows that the fraction of solvable inputs that require recurrence and/or skipping generally increases with dataset difficulty for most models\. As task difficulty grows, valid execution programs become more constrained and increasingly rely on non\-trivial execution structures rather than standard inference\. This indicates that higher latent execution complexity is not merely helpful, but often necessary for solving harder inputs\. The different trend observed for LLaMA\-3\.2\-3B\-Instruct is explained by a mismatch between dataset\-defined difficulty and the model’s effective difficulty \(Table[1](https://arxiv.org/html/2606.06574#S2.T1)\)\.
\(c\) Inference accuracy improves systematically with execution depth\.As shown in Figure[6](https://arxiv.org/html/2606.06574#S2.F6), for all models and difficulty levels, the average accuracy of valid execution programs increases with total execution depth, measured relative to the original forward\-pass depth\. This reveals a consistent computation–accuracy trade\-off: while many inputs admit short execution programs \(Finding 2\), harder inputs benefit from—and often require— deeper or recurrent execution to achieve correct inference\.
Latent execution programs reveal a continuum of test\-time inference behaviors that standard inference cannot access\. Allocating more execution complexity enables harder inputs to be solved and yields higher accuracy\.
Finding 4Valid execution programs are predominantly composed of contiguous layer segments and typically require at most a single recurrence per segment\.
As shown in Figure[7](https://arxiv.org/html/2606.06574#S2.F7), valid execution programs discovered by pretrained models exhibit a strong structural bias toward simplicity\. A segment denotes a set of layers executed as a unit and need not be contiguous, while recurrence always corresponds to re\-execution of the same segment\. We therefore analyze segment structure by measuring the number of consecutive layers within each segment\. Most valid programs are dominated by highly local segments and involve at most a single recurrence; long\-range jumps and deep iterative reuse are rare\. Figure[7](https://arxiv.org/html/2606.06574#S2.F7)\(a\) shows that 54\.5% of segments consist of a single layer, and over two\-thirds contain at most two consecutive layers, whereas segments with predominantly non\-consecutive layers account for less than 3\.2% of cases\. Consistently, Figure[7](https://arxiv.org/html/2606.06574#S2.F7)\(b\) indicates that most segments are repeated at most once\. Together, these results reveal an inherent limitation of pretrained models as execution\-program generators: their training objectives favor short\-range, local reuse over rich program composition and complex control flow\.
These findings show that standard inference selects only one execution from a vast space of valid latent programs\. While MCTS reveals this space, its reliance on sequential search over an exponentially large program space makes it impractical for inference\. This motivates a different approach: rather than searching over programs at test time, we ask whether a lightweight model can directly predict execution programs\. Figure[2](https://arxiv.org/html/2606.06574#S2.F2)contrasts the MCTS\-based sequential search with our proposed direct program prediction approach\. In the remainder of this work, we pursue this learning\-based alternative, retaining the benefits of latent program selection uncovered by MCTS while eliminating sequential search\.
## 3Learning Program\-of\-Layers \(PoLar\) in Large Language Models
Building on our empirical analysis \(Section[2](https://arxiv.org/html/2606.06574#S2)\), we proposePoLar, a method for*programming*pretrained language models at inference time by predicting input\-specific execution programs \(Figure[2](https://arxiv.org/html/2606.06574#S2.F2)\)\.PoLardynamically segments and composes pretrained layers into reusable modules, enabling flexible computation without parameter updates\.
### 3\.1Program Representation
We instantiate the function library using*packed modules*, which segment contiguous pretrained transformer layers into reusable computation units\. For a pretrained model of depthDD, an execution program specifies \(i\) a segmentation of layers into modules and \(ii\) an operation applied to each segment\. Each execution program is represented by two discrete structures: a binary boundary mask encoding the segmentation, and an operation label vector specifying the segment\-level operations\.
Segmentation\.We partition theDDlayers of a pretrained model into contiguous segments
\[0=s1,s2\),\[s2,s3\),…,\[sM,sM\+1=D\),\\textstyle\[0=s\_\{1\},s\_\{2\}\),\\ \[s\_\{2\},s\_\{3\}\),\\ \\ldots,\\ \[s\_\{M\},s\_\{M\+1\}=D\),with each segment length bounded bysj\+1−sj≤Kmaxs\_\{j\+1\}\-s\_\{j\}\\leq K\_\{\\max\}\. Segmentation is represented by a binary boundary mask
𝐳seg\(x\)∈\{0,1\}D,\\textstyle\\mathbf\{z\}^\{\\text\{seg\}\}\(x\)\\in\\\{0,1\\\}^\{D\},where𝐳iseg=1\\mathbf\{z\}^\{\\text\{seg\}\}\_\{i\}=1indicates that layer indexiistarts a new segment, and𝐳iseg=0\\mathbf\{z\}^\{\\text\{seg\}\}\_\{i\}=0otherwise\.
We setKmax=4K\_\{\\max\}=4based on empirical evidence\.Finding 4in Section[2](https://arxiv.org/html/2606.06574#S2)shows that valid execution programs are dominated by short, contiguous layer segments\. Bounding the segment length therefore captures the dominant local execution structures while substantially reducing the complexity of the program space\. Although this representation restricts the set of admissible programs, it preserves the most prevalent compositional patterns in practice and enables stable learning with strong empirical performance\.
Operations\.For each segment\[sj,sj\+1\)\[s\_\{j\},s\_\{j\+1\}\), the execution program assigns one of three operations\{skip,keep,repeat\}\\\{\\textsf\{skip\},\\textsf\{keep\},\\textsf\{repeat\}\\\}, which determines how the segment is executed:
skip:∅,\\displaystyle:\\emptyset,keep:\[sj,…,sj\+1−1\],\\displaystyle:\[s\_\{j\},\\ldots,s\_\{j\+1\}\-1\],repeat:\[sj,…,sj\+1−1,sj,…,sj\+1−1\]\.\\displaystyle:\[s\_\{j\},\\ldots,s\_\{j\+1\}\-1,\\ s\_\{j\},\\ldots,s\_\{j\+1\}\-1\]\.Theskipoperator omits a segment to reduce computation, whilerepeatapplies a single additional pass\. Operations are represented by a categorical label vector
𝐳op\(x\)∈\{skip,keep,repeat\}D,\\textstyle\\mathbf\{z\}^\{\\text\{op\}\}\(x\)\\in\\\{\\textsf\{skip\},\\textsf\{keep\},\\textsf\{repeat\}\\\}^\{D\},where𝐳iop\\mathbf\{z\}^\{\\text\{op\}\}\_\{i\}is defined only when𝐳iseg=1\\mathbf\{z\}^\{\\text\{seg\}\}\_\{i\}=1\(i\.e\., at segment start positions\); labels at all other positions are ignored\.
This operator set is intentionally minimal and empirically grounded\.Finding 4in Section[2](https://arxiv.org/html/2606.06574#S2)shows valid execution programs rarely require more than a single re\-execution within a segment, and thatskipandrepeataccount for the most effective execution patterns, offering strong performance–efficiency trade\-offs\. Thekeepoperator preserves the original computation when no modification is needed\. Although our implementation allows at most one additional execution throughrepeat, the representation is not fundamentally limited to a single recurrence\. The operation vocabulary can be extended to\{repeat\-2,…,repeat\-k\}\\\{\\textsf\{repeat\}\\text\{\-\}2,\\ldots,\\textsf\{repeat\}\\text\{\-\}k\\\}to support multiple recurrences per segment\. We use a single\-repeat operator because the MCTS traces in Section[2](https://arxiv.org/html/2606.06574#S2)show that effective programs rarely benefit from deeper repeated execution of the same segment\. This choice keeps the prediction space tractable while covering the dominant valid programs\.
### 3\.2Program\-of\-Layers \(PoLar\) Prediction Network
We train a lightweight predictor to output logits for the program representation defined in Section[3\.1](https://arxiv.org/html/2606.06574#S3.SS1)\.
Architecture\.Given an inputxx, we first encode it using a frozen embedding model \(Qwen3\-Embedding\-0\.6B\), as token\-level representations𝐇=E\(x\)∈ℝT×dq,\\textstyle\\mathbf\{H\}=E\(x\)\\in\\mathbb\{R\}^\{T\\times d\_\{q\}\},whereTTis the token length anddqd\_\{q\}is the hidden size of the embedding model\. We project token representations to a working dimensiondd:𝐇~=𝐇𝐖h∈ℝT×d\.\\textstyle\\tilde\{\\mathbf\{H\}\}=\\mathbf\{H\}\\mathbf\{W\}\_\{h\}\\in\\mathbb\{R\}^\{T\\times d\}\.
*Layer queries\.*We associate each pretrained transformer layer indexi∈\{0,…,D−1\}i\\in\\\{0,\\ldots,D\-1\\\}with a learnable embedding𝐞i∈ℝd\\mathbf\{e\}\_\{i\}\\in\\mathbb\{R\}^\{d\}, and stack them as𝐄∈ℝD×d\\mathbf\{E\}\\in\\mathbb\{R\}^\{D\\times d\}\. These embeddings act as layer\-specific queries\.
*Cross\-attention\.*We apply multi\-head cross\-attention with layer embeddings as queries and token embeddings as keys/values:𝐗=MHA\(𝐐,𝐊,𝐕\),𝐐=𝐄,𝐊=𝐇~,𝐕=𝐇~,\\textstyle\\mathbf\{X\}=\\textsc\{MHA\}\(\\mathbf\{Q\},\\mathbf\{K\},\\mathbf\{V\}\),\\mathbf\{Q\}=\\mathbf\{E\},\\ \\mathbf\{K\}=\\tilde\{\\mathbf\{H\}\},\\ \\mathbf\{V\}=\\tilde\{\\mathbf\{H\}\},where padding tokens are masked using the input attention mask\. The output𝐗∈ℝD×d\\mathbf\{X\}\\in\\mathbb\{R\}^\{D\\times d\}provides an input\-conditioned representation for each layer index\.
*Cross\-layer encoder\.*To model dependencies across model depth, we apply a lightweight transformer encoder over the layer dimension:
𝐗′=Enclayer\(𝐗\)∈ℝD×d\.\\textstyle\\mathbf\{X\}^\{\\prime\}=\\textsc\{Enc\}\_\{\\text\{layer\}\}\(\\mathbf\{X\}\)\\in\\mathbb\{R\}^\{D\\times d\}\.This enables self\-attention across layers, allowing decisions at each layer to depend on global depth context\.
*Prediction heads\.*Two linear heads produce logits for segmentation boundaries and operations:
ℓseg=𝐗′𝐖seg\+𝐛seg∈ℝD,ℓop=𝐗′𝐖op\+𝐛op∈ℝD×3\.\\bm\{\\ell\}^\{\\text\{seg\}\}=\\mathbf\{X\}^\{\\prime\}\\mathbf\{W\}\_\{\\text\{seg\}\}\+\\mathbf\{b\}\_\{\\text\{seg\}\}\\in\\mathbb\{R\}^\{D\},\\bm\{\\ell\}^\{\\text\{op\}\}=\\mathbf\{X\}^\{\\prime\}\\mathbf\{W\}\_\{\\text\{op\}\}\+\\mathbf\{b\}\_\{\\text\{op\}\}\\in\\mathbb\{R\}^\{D\\times 3\}\.
Supervision from Valid Execution Programs\.We supervise training using valid execution programs collected offline via MCTS \(Section[2](https://arxiv.org/html/2606.06574#S2)\)\. Each program is deterministically parsed into program representation, producing ground\-truth segmentation and operation labels𝐳seg\(x\)\\mathbf\{z\}^\{\\text\{seg\}\}\(x\)and𝐳op\(x\)\\mathbf\{z\}^\{\\text\{op\}\}\(x\)in the format defined in Section[3\.1](https://arxiv.org/html/2606.06574#S3.SS1)\. When multiple valid programs are available for an input and at least one is shorter than the full model depth, we down\-weight the loss of the full\-depth execution\. This choice followsFinding 2, which shows that shorter valid programs are preferred while still preserving supervision from the original computation\.
Training Objective\.We train the predictor to match the ground\-truth execution program, specified by segmentation and operation labels\(𝐳seg∗\(x\),𝐳op∗\(x\)\)\\big\(\\mathbf\{z\}^\{\\text\{seg\}\*\}\(x\),\\mathbf\{z\}^\{\\text\{op\}\*\}\(x\)\\big\)\. Letpiseg=σ\(ℓiseg\)p^\{\\text\{seg\}\}\_\{i\}=\\sigma\(\\ell^\{\\text\{seg\}\}\_\{i\}\)and𝐩iop=Softmax\(ℓiop\)\\mathbf\{p\}^\{\\text\{op\}\}\_\{i\}=\\textsc\{Softmax\}\(\\bm\{\\ell\}^\{\\text\{op\}\}\_\{i\}\)\. Segmentation is supervised with binary cross\-entropy over boundary indicators:
ℒseg=−∑i=0D−1\[𝐳iseg∗logpiseg\+\(1−𝐳iseg∗\)log\(1−piseg\)\]\.\\textstyle\\mathcal\{L\}\_\{\\text\{seg\}\}=\-\\sum\_\{i=0\}^\{D\-1\}\\Big\[\\mathbf\{z\}^\{\\text\{seg\}\*\}\_\{i\}\\log p^\{\\text\{seg\}\}\_\{i\}\+\(1\-\\mathbf\{z\}^\{\\text\{seg\}\*\}\_\{i\}\)\\log\(1\-p^\{\\text\{seg\}\}\_\{i\}\)\\Big\]\.Operation prediction uses a masked cross\-entropy applied only at segment start positions\. With maskmi=𝐳iseg∗m\_\{i\}=\\mathbf\{z\}^\{\\text\{seg\}\*\}\_\{i\}, we compute
ℒop=−∑i=0D−1mi⋅log𝐩iop\[𝐳iop∗\]\.\\textstyle\\mathcal\{L\}\_\{\\text\{op\}\}=\-\\sum\_\{i=0\}^\{D\-1\}m\_\{i\}\\cdot\\log\\mathbf\{p\}^\{\\text\{op\}\}\_\{i\}\\big\[\\mathbf\{z\}^\{\\text\{op\}\*\}\_\{i\}\\big\]\.The final objective isℒ=ℒseg\+ℒop\.\\mathcal\{L\}=\\mathcal\{L\}\_\{\\text\{seg\}\}\+\\mathcal\{L\}\_\{\\text\{op\}\}\.
Inference\-Time Program Decoding\.At inference time, execution programs are decoded in two stages\. First, segment boundaries are determined deterministically by thresholding the predicted segmentation logitsℓseg\\bm\{\\ell\}^\{\\text\{seg\}\}\. If any resulting segment exceeds the maximum length constraintKmaxK\_\{\\max\}, additional boundaries are inserted to enforce it, yielding segment start positions\{sj\}\\\{s\_\{j\}\\\}\. Conditioned on this segmentation, we compute operation log\-probabilities at each segment start from the predicted logits:
logp\(oj∣x,sj\)=logSoftmax\(ℓsjop\)\[oj\]\.\\textstyle\\log p\(o\_\{j\}\\mid x,s\_\{j\}\)=\\log\\textsc\{Softmax\}\\\!\\big\(\\bm\{\\ell\}^\{\\text\{op\}\}\_\{s\_\{j\}\}\\big\)\[o\_\{j\}\]\.Rather than selecting operations independently via local argmax, we apply a small beam search over segment\-level operation choices to account for non\-local interactions between segments and to ensure globally consistent execution programs\. This search operates over a highly constrained space and produces a ranked set of candidate execution programsπ\(x\)\\pi\(x\)\. Finally, each candidate program is mapped deterministically to a concrete executed program using the segment\-to\-path rules in Section[3\.1](https://arxiv.org/html/2606.06574#S3.SS1)\.
Table 2:Pass@k accuracy under different inference strategies applied to LLaMA\-3\.2\-3B\-Instruct\.For difficulty levels DM\-1 to DM\-5,PoLarconsistently outperforms standard inference and dynamic\-depth baselines acrosskkvalues, whereτ=0\\tau=0denotes zero temperature\.Methodp@DM\-1DM\-2DM\-3DM\-4DM\-5LLaMA\-3\.2\-3B\-InstructBase \(τ\\tau=0\)142\.428\.627\.227\.628\.6Base \(sampling\)140\.628\.627\.428\.029\.2244\.035\.029\.630\.432\.8346\.238\.831\.231\.434\.4447\.041\.432\.232\.234\.8547\.643\.232\.832\.835\.6ShortGPT13\.22\.64\.63\.03\.827\.04\.28\.06\.87\.4310\.68\.89\.89\.410\.8412\.811\.012\.012\.213\.8516\.414\.614\.015\.217\.0MindSkip16\.86\.46\.44\.26\.2214\.611\.813\.28\.49\.8322\.818\.818\.812\.815\.4428\.425\.022\.418\.019\.8535\.230\.027\.622\.221\.8FlexiDepth19\.010\.07\.23\.42\.6216\.415\.614\.85\.66\.6320\.021\.419\.87\.29\.4424\.625\.623\.68\.812\.0528\.629\.226\.410\.013\.8DR\.LLM141\.628\.227\.027\.428\.4246\.232\.831\.829\.431\.8349\.436\.035\.431\.434\.0450\.838\.836\.832\.835\.0553\.640\.037\.833\.436\.8PoLar146\.230\.228\.228\.830\.2256\.637\.434\.832\.836\.6362\.842\.839\.435\.640\.2466\.845\.642\.638\.042\.8568\.448\.046\.040\.445\.8Δ\\Deltavs\. Base \(sampling\)5\+20\.8\+4\.8\+13\.2\+7\.6\+10\.2
Table 3:Out\-of\-distribution \(OOD\) performance at pass@1\.We report accuracy using Qwen1\.5\-MoE\-A2\.7B\-Chat\.MMLU\-ProMethodASDivMAWPSMathPhysChemLawEngOtherEconHealthPsychBusBioPhilHistBase \(τ\\tau=0\)59\.141\.713\.915\.613\.816\.615\.122\.831\.026\.830\.718\.434\.722\.822\.6ShortGPT2\.30\.63\.80\.44\.23\.33\.82\.62\.42\.22\.84\.92\.91\.84\.7MindSkip0\.00\.00\.91\.21\.01\.91\.31\.30\.60\.71\.30\.81\.70\.61\.8FlexiDepth0\.00\.02\.13\.52\.54\.43\.32\.51\.73\.52\.02\.93\.83\.03\.7DR\.LLM59\.141\.314\.617\.413\.219\.816\.020\.831\.827\.032\.217\.533\.522\.821\.3PoLar63\.846\.718\.520\.318\.320\.419\.926\.634\.629\.535\.320\.936\.925\.923\.5
## 4Experiments
We evaluatePoLaracross both in\-distribution and out\-of\-distribution benchmarks to assess whether learning*latent execution programs*provides a practical and transferable alternative to search\-based test\-time computation\.
### 4\.1Experimental Setup
Models\.We evaluatePoLaron a diverse set of pretrained, instruction\-tuned LLMs spanning different architectures and scales:LLaMA\-3\.2\-3B\-Instruct,Qwen1\.5\-MoE\-A2\.7B\-Chat,Qwen2\.5\-3B\-Instruct, andQwen3\-8B\. All models are used in a fully frozen setting with no parameter updates\.
Datasets\.We useDART\-Math\(Tonget al\.,[2024](https://arxiv.org/html/2606.06574#bib.bib33)\), a structured mathematical reasoning dataset with five difficulty levels \(DM\-1 to DM\-5\), as in\-distribution benchmark\. For out\-of\-distribution \(OOD\) evaluation, we useASDiv\(Miaoet al\.,[2020](https://arxiv.org/html/2606.06574#bib.bib34)\)andMAWPS\(Kadlčíket al\.,[2023](https://arxiv.org/html/2606.06574#bib.bib35)\), which focus on arithmetic word problems, as well as subject subsets fromMMLU\-Pro\(Wanget al\.,[2024](https://arxiv.org/html/2606.06574#bib.bib36)\)spanning mathematics, natural sciences, social sciences, and humanities\. These benchmarks differ substantially from DART\-Math in both format and domain coverage\.
In\-distribution evaluation\.For DART\-Math, we adopt a difficulty\-wise train/test split: each difficulty level is split independently, and models are trained and evaluated within the same difficulty distribution\.
Out\-of\-distribution \(OOD\) evaluation\.For OOD evaluation,PoLaris trained on the union of DART\-Math training data across all difficulty levels and evaluated zero\-shot\. This setting directly tests whetherPoLarlearns transferable computation control strategies rather than heuristics specific to a dataset or difficulty level\.
Metric\.We reportpass@kkaccuracy, defined as the probability that at least one of the top\-kkcandidates produces a correct answer\. ForPoLar, thekkcandidates correspond to the top\-kkpredicted execution programs selected via beam search\. For sampling\-based baselines,kkcorresponds to the number of stochastic decoding samples\. Unless otherwise stated, OOD results are reported using pass@1\.
Baselines\.We comparePoLaragainst standard inference and representative dynamic\-computation methods\.Base \(τ=0\\tau=0\)uses greedy decoding with temperatureτ=0\\tau=0\.Base \(sampling\)sampleskkoutputs using stochastic decoding withτ∈\{0\.3,0\.7,1\.0\}\\tau\\in\\\{0\.3,0\.7,1\.0\\\}and reports the best result across temperatures, increasing output diversity without altering internal execution\.DR\.LLM\(Heaklet al\.,[2025](https://arxiv.org/html/2606.06574#bib.bib26)\)learns layer\-routing policies from execution paths and applies them at inference time\.ShortGPT\(Menet al\.,[2025](https://arxiv.org/html/2606.06574#bib.bib27)\)statically prunes layers based on estimated importance, yielding a reduced\-depth model\.MindSkip\(Heet al\.,[2024](https://arxiv.org/html/2606.06574#bib.bib28)\)andFlexiDepth\(Luoet al\.,[2025](https://arxiv.org/html/2606.06574#bib.bib29)\)learn router\-based dynamic\-depth policies, primarily optimized for inference efficiency\. Several approaches, such as Mixture\-of\-Depths\(Raposoet al\.,[2024](https://arxiv.org/html/2606.06574#bib.bib30)\), LaCo\(Yanget al\.,[2024](https://arxiv.org/html/2606.06574#bib.bib31)\), and Mixture\-of\-Recursions\(Baeet al\.,[2025](https://arxiv.org/html/2606.06574#bib.bib32)\), require substantial additional training or architectural modification\. In contrast,PoLarperforms lightweight test\-time program selection without modifying pretrained model parameters\.
More dataset and training details are in Appendix[D](https://arxiv.org/html/2606.06574#A4)\.
### 4\.2Main Results
We evaluate in\-distribution performance on DART\-Math\. Table[2](https://arxiv.org/html/2606.06574#S3.T2)reports pass@kkresults usingLLaMA\-3\.2\-3B\-Instruct, with complete results provided in Appendix[C\.1](https://arxiv.org/html/2606.06574#A3.SS1)\.
Figure 8:Pass@k accuracy and unique depth on Llama\-3\.2\-3B\-Instruct\. \(a\) reports pass@kkaccuracy for Base \(τ=0\\tau=0\) andPoLar\. \(b\) illustrates how oftenPoLargenerates solutions that use fewer unique layers than the original model depth\.Accuracy gains arise from improved latent execution within the frozen model\.At pass@1,PoLarconsistently outperforms Base \(sampling\) across all difficulty levels\. For example, accuracy increases from 40\.6% to 46\.2% on DM\-1\. Since pass@1 evaluates a single decoded output, this gain reflects more effective latent execution selection rather than output\-space diversity\.
Exploring the execution\-program space enables effective test\-time scaling\.Increasing the number of candidate execution programs \(kk\) monotonically improvesPoLaracross all difficulty levels, evidencing strong test\-time computation scaling\. Figure[8](https://arxiv.org/html/2606.06574#S4.F8)\(a\) demonstrates monotonic pass@kkgains forPoLar, far exceeding the Base strategy\. Crucially, Figure[8](https://arxiv.org/html/2606.06574#S4.F8)\(b\) shows that these gains often use fewer unique layers than a standard forward pass, indicating that improvements arise from better latent execution programs rather than increased depth\. In contrast, Base \(sampling\) exhibits diminishing returns askkincreases\. At pass@5, it achieves 47\.6/43\.2/32\.8/32\.8/35\.6 across difficulty levels, whilePoLarimproves these to 68\.4/48\.0/46\.0/40\.4/45\.8, yielding up to a \+20\.8% gain\. These results show that structured execution\-program exploration outperforms output sampling under a fixed computation graph\.
Program\-level execution exploration is more effective than local routing decisions\.Existing dynamic\-depth methods primarily make local, layer\-wise routing decisions, which restrict inference to a limited execution space and often degrade accuracy in our setting\. DR\.LLM supports both layer skipping and repetition but operates at the individual\-layer level, limiting global coordination across depth\. In contrast,PoLarformulates inference as*program\-level*exploration over execution programs defined on packed contiguous segments, enabling coordinated skip and repeat patterns across depth\. This design directly reflects the execution structures uncovered by MCTS, while replacing expensive search with a lightweight, learned predictor\.
PoLarincurs negligible inference overhead and reduces end\-to\-end latency\.Beyond counting executed layers, we measure wall\-clock latency onQwen1\.5\-MoE\-A2\.7B\-Chatwith 24 layers\. As shown in Table[4](https://arxiv.org/html/2606.06574#S4.T4), the encoder, predictor head, and beam search introduce a total additional overhead of only 3\.05 ms, corresponding to 0\.8% of a standard forward pass and approximately 0\.23 LLM layers\. This overhead is small compared with the latency reduction achieved by executing fewer layers\. Consequently,PoLarreduces end\-to\-end latency while improving accuracy: it achieves 0\.83×\\timesthe base runtime on easier inputs and 0\.95×\\timeson harder inputs\. The learned predictor is also lightweight in parameter count: across all evaluated backbones, it contains approximately 2\.1M parameters, corresponding to only 0\.01%–0\.06% of the base LLM size\. Full parameter counts are provided in Appendix[D\.3](https://arxiv.org/html/2606.06574#A4.SS3)\.
Table 4:Component\-wise inference overhead and end\-to\-end latency ofPoLaronQwen1\.5\-MoE\-A2\.7B\-Chatwith 24 layers\. The additional components introduce negligible overhead compared with a standard forward pass, whilePoLarreduces end\-to\-end latency and improves accuracy\.Component\-wise overheadComponentLatency \(ms\)Equiv\. layers% full forwardOne LLM layer13\.231\.003\.5%Predictor head0\.990\.070\.3%Beam search0\.110\.010\.03%Encoder1\.950\.150\.5%Total additional overhead3\.050\.230\.8%End\-to\-end latencyMethodAvg\. layersLatency \(ms\)Rel\. / Acc\. gainBase24\.00373\.451\.00×\\times/ –PoLar\(DM\-1\)23\.30311\.410\.83×\\times/ \+5\.8PoLar\(DM\-5\)23\.76353\.310\.95×\\times/ \+1\.2
### 4\.3Out\-of\-Distribution Performance
We evaluate the OOD generalization of execution programs learned from in\-distribution data\. As shown in Table[3](https://arxiv.org/html/2606.06574#S3.T3),PoLarconsistently outperforms standard inference on all OOD benchmarks usingQwen1\.5\-MoE\-A2\.7B\-Chat, with full results reported in Appendix[C\.2](https://arxiv.org/html/2606.06574#A3.SS2)\.
Execution programs learned on mathematical datasets transfer across domains\.On arithmetic word problem benchmarks such as ASDiv and MAWPS,PoLarachieves clear improvements over the standard forward pass\. More notably, on MMLU\-Pro,PoLarimproves accuracy across diverse subject areas\. We conjecture that this cross\-domain transfer comes from two complementary sources\. First, the external input representation maps examples from different domains into a shared semantic space, allowing the smallPoLarprediction head trained on mathematics to generalize beyond its training distribution\. Second, the predicted programs are constrained to simple structural patterns, namely contiguous segments with limited recurrence, which encourages reusable computation strategies rather than benchmark\-specific execution heuristics\.
## 5Related Works
Transformers process inputs through sequential layer stacks, making layer\-level computation reduction a critical research direction\. Early\-exit and layer skipping methods\(Liuet al\.,[2020](https://arxiv.org/html/2606.06574#bib.bib3); Xinet al\.,[2020](https://arxiv.org/html/2606.06574#bib.bib4); Zhouet al\.,[2020](https://arxiv.org/html/2606.06574#bib.bib5); Liuet al\.,[2021a](https://arxiv.org/html/2606.06574#bib.bib9)\)dynamically terminate computation at intermediate layers using auxiliary classifiers and confidence metrics, allowing easy inputs to exit early\. LayerSkip\(Elhoushiet al\.,[2024](https://arxiv.org/html/2606.06574#bib.bib37)\)shares classifiers across layers to reduce overhead\. LayerDrop\(Fanet al\.,[2019](https://arxiv.org/html/2606.06574#bib.bib20)\)trains models so arbitrary layer subsets can be skipped during inference\. ShortGPT\(Menet al\.,[2025](https://arxiv.org/html/2606.06574#bib.bib27)\)assesses layer importance based on input\-output similarity and drops low\-importance layers\. LaCo\(Yanget al\.,[2024](https://arxiv.org/html/2606.06574#bib.bib31)\)merges layers using weight arithmetic\. Recent work introduces learned routing for adaptive skipping: FlexiDepth\(Luoet al\.,[2025](https://arxiv.org/html/2606.06574#bib.bib29)\)and MindSkip\(Heet al\.,[2024](https://arxiv.org/html/2606.06574#bib.bib28)\)attach lightweight routers to pretrained models for input\-adaptive layer skipping\.
In addition to skipping layers, another line of research explores layer reuse and recurrence\. Universal Transformers\(Dehghaniet al\.,[2018](https://arxiv.org/html/2606.06574#bib.bib21)\)apply self\-attention blocks recurrently with halting mechanisms to adapt depth per token\. Recent looped transformers\(Fanet al\.,[2024](https://arxiv.org/html/2606.06574#bib.bib22); Yanget al\.,[2023](https://arxiv.org/html/2606.06574#bib.bib10)\)repeatedly apply single blocks to achieve better length generalization on algorithmic tasks by adjusting loop counts during inference\. The Inner Thinking Transformer\(Chenet al\.,[2025](https://arxiv.org/html/2606.06574#bib.bib11)\)interleaves adaptive loops with residual ”thinking” connections and per\-token routing, devoting extra computation to difficult tokens\. While these approaches demonstrate the value of recurrence, they require architectural redesign and training from scratch\.
Liet al\.\([2025](https://arxiv.org/html/2606.06574#bib.bib38)\)studies test\-time depth adaptation by using search to dynamically skip or repeat pretrained transformer layers without finetuning\. Their work demonstrates that alternative execution paths can improve inference, but the method remains search\-based and requires expensive per\-input program discovery\. In contrast, we use MCTS only as an offline diagnostic tool to characterize the structure of the program space, and then replace search with a learned predictor that generates execution programs in a single shot\.
Following this direction, DR\.LLM\(Heaklet al\.,[2025](https://arxiv.org/html/2606.06574#bib.bib26)\)learns routing policies from MCTS\-generated supervision and supports both skipping and repeating layers\. However, DR\.LLM performs sequential layer\-wise routing, where each decision is made locally during the forward pass and depends on intermediate hidden states\. In contrast,PoLarpredicts the entire execution program upfront, before executing the frozen LLM\. This avoids interleaving routing with layer execution and enables more efficient inference\. Moreover, DR\.LLM is limited to single\-layer recurrence, whereasPoLaroperates on contiguous layer segments; for example,PoLarcan represent multi\-layer recurrent modules such as4→5→4→54\{\\rightarrow\}5\{\\rightarrow\}4\{\\rightarrow\}5, which are outside the single\-layer routing space\. Thus,PoLarprovides a more coordinated and expressive program space while preserving fully frozen base model parameters\.
## 6Conclusion
We show that inference in LLMs need not be limited to a fixed\-depth forward pass\. By viewing pretrained transformer layers as reusable functions, we uncover multiple valid execution programs for a single input, many of which are shorter than standard execution and can correct model errors\. Motivated by this insight, we introducePoLar, a lightweight framework that predicts input\-dependent execution programs by selectively skipping or repeating contiguous layer segments at inference time, without modifying model parameters\. Across models and both in\-distribution and out\-of\-distribution benchmarks,PoLarconsistently outperforms standard inference and prior dynamic\-depth methods\. These findings suggest that fixed\-depth execution captures only a narrow subset of an LLM’s latent reasoning capacity\. Enabling flexible, programmatic execution over pretrained layers reallocates computation at inference time, offering a simple and effective route to more expressive and efficient inference in foundation models\.
## Impact Statement
This work adapts the internal computation of pretrained LLMs at inference time by dynamically skipping or repeating layer segments\. Its main potential benefit is improved efficiency:PoLarcan reduce unnecessary computation on easier inputs while allocating more latent computation to harder ones, lowering inference cost, latency, and energy use without retraining the base model\. This may make capable LLMs more accessible to researchers and practitioners with limited compute, and may support more sustainable deployment of foundation models\. As with other methods that improve LLM capability or efficiency, broader deployment may amplify both beneficial and harmful uses\. Potential benefits include education, scientific reasoning, and software assistance, while potential misuse includes scalable generation of misleading or harmful content\. These risks largely arise from the underlying pretrained models and their applications rather than from dynamic execution itself\. SincePoLarchanges the execution path in an input\-dependent manner, future work may further study how such paths can be audited or interpreted\. Our experiments focus on mathematical reasoning and related benchmarks\. Before applying dynamic execution in high\-stakes domains, future work should evaluate robustness, calibration, interpretability, and safety alongside accuracy and efficiency\. We hope this work encourages more responsible test\-time computation methods that improve model performance while making compute allocation more transparent and efficient\.
## References
- J\. Andreas, M\. Rohrbach, T\. Darrell, and D\. Klein \(2016\)Neural module networks\.InProceedings of the IEEE conference on computer vision and pattern recognition,pp\. 39–48\.Cited by:[Appendix A](https://arxiv.org/html/2606.06574#A1.SS0.SSS0.Px3.p2.1)\.
- S\. Bae, Y\. Kim, R\. Bayat, S\. Kim, J\. Ha, T\. Schuster, A\. Fisch, H\. Harutyunyan, Z\. Ji, A\. Courville,et al\.\(2025\)Mixture\-of\-recursions: learning dynamic recursive depths for adaptive token\-level computation\.arXiv preprint arXiv:2507\.10524\.Cited by:[§4\.1](https://arxiv.org/html/2606.06574#S4.SS1.p6.4)\.
- Y\. Chen, J\. Shang, Z\. Zhang, Y\. Xie, J\. Sheng, T\. Liu, S\. Wang, Y\. Sun, H\. Wu, and H\. Wang \(2025\)Inner thinking transformer: leveraging dynamic depth scaling to foster adaptive internal thinking\.arXiv preprint arXiv:2502\.13842\.Cited by:[Appendix A](https://arxiv.org/html/2606.06574#A1.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2606.06574#S5.p2.1)\.
- M\. Dehghani, S\. Gouws, O\. Vinyals, J\. Uszkoreit, and Ł\. Kaiser \(2018\)Universal transformers\.arXiv preprint arXiv:1807\.03819\.Cited by:[Appendix A](https://arxiv.org/html/2606.06574#A1.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2606.06574#S5.p2.1)\.
- M\. Elhoushi, A\. Shrivastava, D\. Liskovich, B\. Hosmer, B\. Wasti, L\. Lai, A\. Mahmoud, B\. Acun, S\. Agarwal, A\. Roman,et al\.\(2024\)Layerskip: enabling early exit inference and self\-speculative decoding\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 12622–12642\.Cited by:[§5](https://arxiv.org/html/2606.06574#S5.p1.1)\.
- C\. Eyzaguirre, F\. Del Rio, V\. Araujo, and A\. Soto \(2022\)DACT\-bert: differentiable adaptive computation time for an efficient bert inference\.InProceedings of NLP Power\! The First Workshop on Efficient Benchmarking in NLP,pp\. 93–99\.Cited by:[Appendix A](https://arxiv.org/html/2606.06574#A1.SS0.SSS0.Px1.p2.1)\.
- A\. Fan, E\. Grave, and A\. Joulin \(2019\)Reducing transformer depth on demand with structured dropout\.arXiv preprint arXiv:1909\.11556\.Cited by:[Appendix A](https://arxiv.org/html/2606.06574#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.06574#S1.p2.1),[§5](https://arxiv.org/html/2606.06574#S5.p1.1)\.
- Y\. Fan, Y\. Du, K\. Ramchandran, and K\. Lee \(2024\)Looped transformers for length generalization\.arXiv preprint arXiv:2409\.15647\.Cited by:[Appendix A](https://arxiv.org/html/2606.06574#A1.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2606.06574#S1.p2.1),[§5](https://arxiv.org/html/2606.06574#S5.p2.1)\.
- M\. Gordon, K\. Duh, and N\. Andrews \(2020\)Compressing bert: studying the effects of weight pruning on transfer learning\.InProceedings of the 5th Workshop on Representation Learning for NLP,pp\. 143–155\.Cited by:[Appendix A](https://arxiv.org/html/2606.06574#A1.SS0.SSS0.Px1.p1.1)\.
- S\. He, T\. Ge, G\. Sun, B\. Tian, X\. Wang, and D\. Yu \(2024\)Router\-tuning: a simple and effective approach for enabling dynamic\-depth in transformers\.arXiv preprint arXiv:2410\.13184\.Cited by:[§4\.1](https://arxiv.org/html/2606.06574#S4.SS1.p6.4),[§5](https://arxiv.org/html/2606.06574#S5.p1.1)\.
- A\. Heakl, M\. Gubri, S\. Khan, S\. Yun, and S\. J\. Oh \(2025\)Dr\. llm: dynamic layer routing in llms\.arXiv preprint arXiv:2510\.12773\.Cited by:[§4\.1](https://arxiv.org/html/2606.06574#S4.SS1.p6.4),[§5](https://arxiv.org/html/2606.06574#S5.p4.1)\.
- G\. Jain, N\. Hegde, A\. Kusupati, A\. Nagrani, S\. Buch, P\. Jain, A\. Arnab, and S\. Paul \(2024\)Mixture of nested experts: adaptive processing of visual tokens\.Advances in Neural Information Processing Systems37,pp\. 58480–58497\.Cited by:[Appendix A](https://arxiv.org/html/2606.06574#A1.SS0.SSS0.Px3.p1.1)\.
- M\. Kadlčík, M\. Štefánik, O\. Sotolár, and V\. Martinek \(2023\)Calc\-x and calcformers: empowering arithmetical chain\-of\-thought through interaction with symbolic systems\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 12101–12108\.Cited by:[§4\.1](https://arxiv.org/html/2606.06574#S4.SS1.p2.1)\.
- Z\. Li, Y\. Li, and T\. Zhou \(2025\)Skip a layer or loop it? test\-time depth adaptation of pretrained llms\.arXiv preprint arXiv:2507\.07996\.Cited by:[§1](https://arxiv.org/html/2606.06574#S1.p2.1),[§5](https://arxiv.org/html/2606.06574#S5.p3.1)\.
- W\. Liu, P\. Zhou, Z\. Wang, Z\. Zhao, H\. Deng, and Q\. Ju \(2020\)Fastbert: a self\-distilling bert with adaptive inference time\.InProceedings of the 58th annual meeting of the association for computational linguistics,pp\. 6035–6044\.Cited by:[Appendix A](https://arxiv.org/html/2606.06574#A1.SS0.SSS0.Px1.p2.1),[§1](https://arxiv.org/html/2606.06574#S1.p1.1),[§1](https://arxiv.org/html/2606.06574#S1.p2.1),[§5](https://arxiv.org/html/2606.06574#S5.p1.1)\.
- Y\. Liu, F\. Meng, J\. Zhou, Y\. Chen, and J\. Xu \(2021a\)Faster depth\-adaptive transformers\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.35,pp\. 13424–13432\.Cited by:[Appendix A](https://arxiv.org/html/2606.06574#A1.SS0.SSS0.Px1.p2.1),[§1](https://arxiv.org/html/2606.06574#S1.p1.1),[§5](https://arxiv.org/html/2606.06574#S5.p1.1)\.
- Z\. Liu, F\. Li, G\. Li, and J\. Cheng \(2021b\)EBERT: efficient bert inference with dynamic structured pruning\.InFindings of the Association for Computational Linguistics: ACL\-IJCNLP 2021,pp\. 4814–4823\.Cited by:[Appendix A](https://arxiv.org/html/2606.06574#A1.SS0.SSS0.Px1.p1.1)\.
- X\. Luo, W\. Wang, and X\. Yan \(2025\)Adaptive layer\-skipping in pre\-trained llms\.arXiv preprint arXiv:2503\.23798\.Cited by:[§4\.1](https://arxiv.org/html/2606.06574#S4.SS1.p6.4),[§5](https://arxiv.org/html/2606.06574#S5.p1.1)\.
- X\. Men, M\. Xu, Q\. Zhang, Q\. Yuan, B\. Wang, H\. Lin, Y\. Lu, X\. Han, and W\. Chen \(2025\)Shortgpt: layers in large language models are more redundant than you expect\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 20192–20204\.Cited by:[§4\.1](https://arxiv.org/html/2606.06574#S4.SS1.p6.4),[§5](https://arxiv.org/html/2606.06574#S5.p1.1)\.
- S\. Miao, C\. Liang, and K\. Su \(2020\)A diverse corpus for evaluating and developing english math word problem solvers\.InProceedings of the 58th annual meeting of the Association for Computational Linguistics,pp\. 975–984\.Cited by:[§4\.1](https://arxiv.org/html/2606.06574#S4.SS1.p2.1)\.
- D\. Raposo, S\. Ritter, B\. Richards, T\. Lillicrap, P\. C\. Humphreys, and A\. Santoro \(2024\)Mixture\-of\-depths: dynamically allocating compute in transformer\-based language models\.arXiv preprint arXiv:2404\.02258\.Cited by:[§4\.1](https://arxiv.org/html/2606.06574#S4.SS1.p6.4)\.
- S\. Tang, Y\. Wang, Z\. Kong, T\. Zhang, Y\. Li, C\. Ding, Y\. Wang, Y\. Liang, and D\. Xu \(2023\)You need multiple exiting: dynamic early exiting for accelerating unified vision language model\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 10781–10791\.Cited by:[Appendix A](https://arxiv.org/html/2606.06574#A1.SS0.SSS0.Px1.p3.1)\.
- Y\. Tong, X\. Zhang, R\. Wang, R\. Wu, and J\. He \(2024\)Dart\-math: difficulty\-aware rejection tuning for mathematical problem\-solving\.Advances in Neural Information Processing Systems37,pp\. 7821–7846\.Cited by:[§2](https://arxiv.org/html/2606.06574#S2.p6.1),[§4\.1](https://arxiv.org/html/2606.06574#S4.SS1.p2.1)\.
- Y\. Wang, X\. Ma, G\. Zhang, Y\. Ni, A\. Chandra, S\. Guo, W\. Ren, A\. Arulraj, X\. He, Z\. Jiang,et al\.\(2024\)Mmlu\-pro: a more robust and challenging multi\-task language understanding benchmark\.Advances in Neural Information Processing Systems37,pp\. 95266–95290\.Cited by:[§4\.1](https://arxiv.org/html/2606.06574#S4.SS1.p2.1)\.
- Q\. Wu, Z\. Ke, Y\. Zhou, X\. Sun, and R\. Ji \(2024\)Routing experts: learning to route dynamic experts in multi\-modal large language models\.arXiv preprint arXiv:2407\.14093\.Cited by:[Appendix A](https://arxiv.org/html/2606.06574#A1.SS0.SSS0.Px3.p1.1)\.
- J\. Xin, R\. Tang, J\. Lee, Y\. Yu, and J\. Lin \(2020\)DeeBERT: dynamic early exiting for accelerating bert inference\.arXiv preprint arXiv:2004\.12993\.Cited by:[Appendix A](https://arxiv.org/html/2606.06574#A1.SS0.SSS0.Px1.p2.1),[§1](https://arxiv.org/html/2606.06574#S1.p1.1),[§1](https://arxiv.org/html/2606.06574#S1.p2.1),[§5](https://arxiv.org/html/2606.06574#S5.p1.1)\.
- G\. Xu, J\. Hao, L\. Shen, H\. Hu, Y\. Luo, H\. Lin, and J\. Shen \(2023\)Lgvit: dynamic early exiting for accelerating vision transformer\.InProceedings of the 31st ACM International Conference on Multimedia,pp\. 9103–9114\.Cited by:[Appendix A](https://arxiv.org/html/2606.06574#A1.SS0.SSS0.Px1.p3.1)\.
- L\. Yang, K\. Lee, R\. Nowak, and D\. Papailiopoulos \(2023\)Looped transformers are better at learning learning algorithms\.arXiv preprint arXiv:2311\.12424\.Cited by:[Appendix A](https://arxiv.org/html/2606.06574#A1.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2606.06574#S1.p2.1),[§5](https://arxiv.org/html/2606.06574#S5.p2.1)\.
- Y\. Yang, Z\. Cao, and H\. Zhao \(2024\)Laco: large language model pruning via layer collapse\.arXiv preprint arXiv:2402\.11187\.Cited by:[§4\.1](https://arxiv.org/html/2606.06574#S4.SS1.p6.4),[§5](https://arxiv.org/html/2606.06574#S5.p1.1)\.
- W\. Zhou, C\. Xu, T\. Ge, J\. McAuley, K\. Xu, and F\. Wei \(2020\)Bert loses patience: fast and robust inference with early exit\.Advances in Neural Information Processing Systems33,pp\. 18330–18341\.Cited by:[Appendix A](https://arxiv.org/html/2606.06574#A1.SS0.SSS0.Px1.p2.1),[§1](https://arxiv.org/html/2606.06574#S1.p1.1),[§1](https://arxiv.org/html/2606.06574#S1.p2.1),[§5](https://arxiv.org/html/2606.06574#S5.p1.1)\.
## Appendix ARelated Work
#### Layer Pruning and Early\-Exit Neural Networks
Many works aim to accelerate large Transformers by statically pruning weights or dynamically halting computation\. Static pruning typically removes redundant neurons, heads, or layers after training\. For example,Liuet al\.\([2021b](https://arxiv.org/html/2606.06574#bib.bib1)\)demonstrate that a significant fraction of BERT’s attention heads can be dropped with negligible performance loss, andGordonet al\.\([2020](https://arxiv.org/html/2606.06574#bib.bib2)\)investigate fine\-grained weight pruning in BERT\.\(Fanet al\.,[2019](https://arxiv.org/html/2606.06574#bib.bib20)\)introduce LayerDrop, a structured dropout technique that effectively trains models so arbitrary subsets of layers can be skipped during inference without requiring fine\-tuning\. These methods produce smaller models that trade computation for a small accuracy loss\.
By contrast, early\-exit or input\-adaptive methods add auxiliary classifiers at intermediate layers so that ”easy” inputs exit early\. Notable examples include FastBERT\(Liuet al\.,[2020](https://arxiv.org/html/2606.06574#bib.bib3)\)and DeeBERT\(Xinet al\.,[2020](https://arxiv.org/html/2606.06574#bib.bib4)\), which insert classifiers after each block and use confidence or entropy metrics to decide when to stop\. PABEE\(Zhouet al\.,[2020](https://arxiv.org/html/2606.06574#bib.bib5)\)employs a patience criterion to halt when predictions stabilize\. DACT\-BERT\(Eyzaguirreet al\.,[2022](https://arxiv.org/html/2606.06574#bib.bib6)\)adopts a differentiable Adaptive Computation Time mechanism to learn how many Transformer layers to run for each example\.Liuet al\.\([2021a](https://arxiv.org/html/2606.06574#bib.bib9)\)estimate input ”hardness” via mutual information or reconstruction error to pre\-determine the number of Transformer layers to use\.
These early\-exit networks achieve significant speedups on NLP tasks by adaptively reducing depth per input\. More recently, early\-exit ideas have been extended to vision and multimodal Transformers\.Xuet al\.\([2023](https://arxiv.org/html/2606.06574#bib.bib7)\)propose LGViT, which adds heterogeneous exit heads \(local and global\) to ViT so that vision transformers can terminate early with minimal feature loss\.Tanget al\.\([2023](https://arxiv.org/html/2606.06574#bib.bib8)\)introduce MuE \(“Multiple Exiting”\), a strategy for unified vision\-language models that dynamically skips layers in both encoder and decoder based on input similarity\. These works demonstrate that later layers can be skipped to allow image and vision\-language models to adapt computation per sample with minimal accuracy drop\. Our work generalizes this approach by allowing skipping of arbitrary layers and enabling reuse of certain layers\.
#### Looped Transformer and Recurrent Depth
Another line of research makes Transformer depth adaptive by looping or repeating layers\. The Universal Transformer\(Dehghaniet al\.,[2018](https://arxiv.org/html/2606.06574#bib.bib21)\)was an early example: it applies the same self\-attention block recurrently and uses a halting mechanism to determine when each position is “done” \(adapting depth per token\)\. Building on these ideas, recent work explicitly introduces loops in model architectures\.Fanet al\.\([2024](https://arxiv.org/html/2606.06574#bib.bib22)\)demonstrate that a Looped Transformer – a single Transformer block applied repeatedly – can achieve much better length generalization on algorithmic tasks by adjusting the number of loops during inference\. Similarly,Yanget al\.\([2023](https://arxiv.org/html/2606.06574#bib.bib10)\)note that looped architectures excel at learning algorithms by explicitly incorporating iterative characteristics into the transformer architecture\. More sophisticated variants like the Inner Thinking TransformerChenet al\.\([2025](https://arxiv.org/html/2606.06574#bib.bib11)\)interleave adaptive loops with residual “thinking” connections and per\-token routing, enabling the model to devote extra computation only to particularly difficult tokens\. In summary, these approaches explore recurrent or elastic depth via explicit loops to tailor the number of applied layers to each input’s complexity\. Unlike our approach, they require special architecture design and training from scratch, whereas our work focuses on pure test\-time adaptation\.
#### Dynamic Routing and Modular Inference
A third theme treats networks as collections of modules or experts with dynamically chosen pathways per sample\. Mixture\-of\-Experts \(MoE\) Transformer layers are a well\-known example: they maintain multiple sub\-networks \(“experts”\) and route each token to a subset\.Wuet al\.\([2024](https://arxiv.org/html/2606.06574#bib.bib12)\)introduce Routing Experts \(RoE\) for multimodal LLMs, retrofitting trained models into a mixture\-of\-experts style by learning input\-dependent shortcut routes through layers, guided by sparsity regularizers\.Jainet al\.\([2024](https://arxiv.org/html/2606.06574#bib.bib14)\)present Mixture of Nested Experts \(MoNE\): experts organized in a hierarchy of increasing capacity, where tokens are sent to smaller experts when sufficient\. MoNE learns to prioritize easy tokens through low\-cost experts and reserve full models for hard cases, halving inference compute on ImageNet/Video tasks\.
These methods exemplify sample\-wise routing: at inference time, the model conditionally activates different sub\-modules or experts for each input\. Similarly, neural module networks\(Andreaset al\.,[2016](https://arxiv.org/html/2606.06574#bib.bib13)\)assemble task\-specific computation graphs from a library of modules\. In modern LLMs/VLMs, these routing approaches – whether through gating experts, skipping layers, or assembling modules – form a spectrum of modular inference techniques that adapt the computation graph on a per\-sample basis to balance cost and accuracy\. Interestingly, our work suggests that transformer layers can function effectively as modules even without being specifically trained for that purpose\.
## Appendix BSearching the Execution Program Space
This appendix provides full details of the execution\-program search procedure used to test the conjecture in Section[2](https://arxiv.org/html/2606.06574#S2)\. The search is used purely as a diagnostic tool to study the existence and structure of valid execution programs, rather than as a practical inference\-time method\.
### B\.1Execution Program Space
We follow the formalization in the main text and represent inference as the execution of a variable\-length program that composes pretrained transformer layers\. Consider a pretrained LLM withDDtransformer layers, where each layer defines a fixed computation function
fi:ℝT×d→ℝT×d,i∈\{0,…,D−1\}\.f\_\{i\}:\\mathbb\{R\}^\{T\\times d\}\\rightarrow\\mathbb\{R\}^\{T\\times d\},\\quad i\\in\\\{0,\\ldots,D\-1\\\}\.An execution program is defined as a finite sequence of layer indices
π=\(i1,i2,…,iK\),ik∈\{0,…,D−1\},\\pi=\(i\_\{1\},i\_\{2\},\\ldots,i\_\{K\}\),\\quad i\_\{k\}\\in\\\{0,\\ldots,D\-1\\\},which induces the composed computation
Fπ=fiK∘⋯∘fi1\.F\_\{\\pi\}=f\_\{i\_\{K\}\}\\circ\\cdots\\circ f\_\{i\_\{1\}\}\.Executing a program applies this composition to the input representation and produces a prediction\. A program is considered*valid*for a given input if it yields a correct prediction\.
Programs may be shorter than the standard forward pass through layer skipping, or longer through layer repetition\. Increasing program length corresponds to increasing the number of latent reasoning steps\.
### B\.2Search Space Constraints
The unconstrained space of programs grows exponentially with program length\. To make search tractable while preserving expressiveness, we restrict the action space to structured operations on contiguous subsequences of layer indices\. Specifically, we allow two classes of actions:
- •Skip: remove a contiguous block ofkkindices from the program;
- •Repeat: duplicate a contiguous block ofkkindices forrrrepetitions\.
In all experiments, block sizekkand repetition countrrare bounded by small constants \(k,r≤4k,r\\leq 4\)\. These constraints significantly reduce the branching factor while retaining the ability to realize layer skipping, recurrence, and emergent reordering patterns\.
### B\.3Monte Carlo Tree Search Formulation
We formulate program discovery as a sequential decision process and employ Monte Carlo Tree Search \(MCTS\) to explore the constrained program space\.
#### State and Actions\.
Each MCTS node corresponds to a partial or complete execution programπ\\pi\. Actions modify the current program by applying a valid skip or repeat operation, yielding a new program\.
#### Reward\.
For a completed programπ\\piand inputxxwith ground\-truth answeryy, we define a binary reward
r\(π,x\)=𝟏\{Fπ\(x\)=y\},r\(\\pi,x\)=\\mathbf\{1\}\\\{F\_\{\\pi\}\(x\)=y\\\},whereFπ\(x\)F\_\{\\pi\}\(x\)denotes executing the composed computation induced byπ\\pi\.
#### Tree Policy\.
Tree traversal is guided by a UCB\-style objective that balances exploitation, exploration, and program length regularization:
UCB\(π\)=R\(π\)v\(π\)\+clnVv\(π\)−λ\|π\|D,\\mathrm\{UCB\}\(\\pi\)=\\frac\{R\(\\pi\)\}\{v\(\\pi\)\}\+c\\sqrt\{\\frac\{\\ln V\}\{v\(\\pi\)\}\}\-\\lambda\\frac\{\|\\pi\|\}\{D\},whereR\(π\)R\(\\pi\)is the cumulative reward,v\(π\)v\(\\pi\)is the visit count,VVis the total number of simulations, andλ\\lambdapenalizes long programs to encourage efficiency\.
### B\.4Search Algorithm
Algorithm[1](https://arxiv.org/html/2606.06574#alg1)summarizes the MCTS procedure\. We initialize the root node with the standard forward executionπ0=\(0,1,…,D−1\)\\pi\_\{0\}=\(0,1,\\ldots,D\-1\)and iteratively perform selection, expansion, simulation, and backpropagation\. After a fixed number of simulations, we collect all explored programs with nonzero visit counts and analyze their validity and structural properties\.
Algorithm 1Monte Carlo Tree Search for Execution Programs0:Input
xx, number of simulations
NsimN\_\{\\text\{sim\}\}
1:Initialize root program
π0=\(0,1,…,D−1\)\\pi\_\{0\}=\(0,1,\\ldots,D\-1\)
2:for
n=1n=1to
NsimN\_\{\\text\{sim\}\}do
3:Selection:traverse tree using UCB to reach a leaf program
4:Expansion:generate child programs via valid skip/repeat actions
5:Simulation:execute
Fπ\(x\)F\_\{\\pi\}\(x\)and compute reward
r\(π,x\)r\(\\pi,x\)
6:Backpropagation:update visit counts and cumulative rewards
7:endfor
8:returnexplored programs and associated statistics
## Appendix CExperimental Results
This appendix provides additional experimental results that complement the main paper\. We report full quantitative comparisons for both in\-distribution and out\-of\-distribution evaluations across multiple pretrained LLMs\. Unless otherwise stated, all models are evaluated in a fully frozen setting, andPoLaronly predicts execution programs at inference time\.
### C\.1In\-Distribution Performance
We first report detailed in\-distribution results on DART\-Math, a structured mathematical reasoning benchmark with five difficulty levels \(DM\-1 to DM\-5\)\. Tables[5](https://arxiv.org/html/2606.06574#A3.T5),[6](https://arxiv.org/html/2606.06574#A3.T6), and[7](https://arxiv.org/html/2606.06574#A3.T7)present pass@k accuracy under different inference strategies for Qwen1\.5\-MoE\-A2\.7B\-Chat, Qwen2\.5\-3B\-Instruct, and Qwen3\-8B, respectively\.
Across all models and difficulty levels,PoLarconsistently outperforms standard inference and prior dynamic\-depth baselines\. In particular, increasingp@kp@kleads to monotonic accuracy improvements forPoLar, demonstrating effective test\-time computation scaling through execution\-program exploration\. Atp@5p@5,PoLarachieves substantial absolute gains over Base \(sampling\), with improvements of \+22\.0 / \+18\.4 / \+14\.4 / \+10\.4 / \+11\.4 on Qwen1\.5\-MoE\-A2\.7B\-Chat, \+17\.6 / \+10\.4 / \+7\.8 / \+2\.2 / \+9\.8 on Qwen2\.5\-3B\-Instruct, and \+7\.2 / \+17\.0 / \+11\.8 / \+5\.4 / \+7\.2 on Qwen3\-8B for DM\-1 to DM\-5, respectively\.
Notably, methods that rely solely on layer skipping \(e\.g\., ShortGPT, MindSkip, FlexiDepth\) often suffer severe accuracy degradation, especially on harder difficulty levels\. In contrast,PoLarjointly supports layer skipping and recurrence, allowing it to retain or improve accuracy while exploring diverse execution programs\. Compared to DR\.LLM, which performs layer\-level routing,PoLarconsistently achieves higher accuracy, particularly at largerp@kp@k, indicating the benefit of structured, program\-level execution prediction\.
Table 5:Pass@k accuracy under different inference strategies applied to Qwen1\.5\-MoE\-A2\.7B\-Chat\.For difficulty levels DM\-1 to DM\-5,PoLarconsistently outperforming the Base program across allkkvalues, whereτ=0\\tau=0denotes zero temperature\.Methodp@DM\-1DM\-2DM\-3DM\-4DM\-5Qwen1\.5\-MoE\-A2\.7B\-ChatBase \(τ\\tau=0\)138\.024\.620\.615\.214\.2Base \(sampling\)135\.422\.014\.612\.66\.6236\.424\.215\.813\.68\.2338\.824\.616\.614\.09\.6439\.625\.017\.814\.811\.2540\.025\.618\.615\.011\.8ShortGPT119\.612\.27\.25\.43\.4222\.415\.09\.47\.25\.2325\.616\.610\.68\.66\.2426\.617\.611\.49\.67\.0527\.218\.612\.410\.27\.6MindSkip10\.00\.00\.00\.00\.020\.00\.00\.00\.00\.030\.00\.00\.00\.00\.040\.00\.00\.00\.00\.050\.00\.00\.00\.00\.0FlexiDepth10\.00\.00\.00\.00\.020\.00\.40\.20\.00\.030\.00\.80\.20\.00\.040\.40\.80\.20\.00\.050\.41\.00\.20\.00\.0DR\.LLM138\.024\.620\.615\.214\.2243\.829\.823\.218\.218\.0344\.430\.223\.418\.218\.8444\.430\.423\.818\.220\.0548\.436\.827\.422\.821\.6PoLar143\.826\.222\.017\.215\.4253\.234\.628\.220\.819\.8358\.039\.029\.022\.421\.6460\.841\.231\.223\.622\.6562\.044\.033\.025\.423\.2Δ\\Deltavs\. Base \(sampling\)5\+22\.0\+18\.4\+14\.4\+10\.4\+11\.4
Table 6:Pass@k accuracy under different inference strategies applied to Qwen2\.5\-3B\-Instruct\.For difficulty levels DM\-1 to DM\-5,PoLarconsistently outperforming the Base program across allkkvalues, whereτ=0\\tau=0denotes zero temperature\.Methodp@DM\-1DM\-2DM\-3DM\-4DM\-5Qwen2\.5\-3B\-InstructBase \(τ\\tau=0\)122\.013\.611\.07\.85\.2Base \(sampling\)124\.216\.011\.68\.25\.4232\.622\.816\.012\.28\.6337\.826\.818\.213\.810\.0440\.629\.219\.215\.210\.6542\.230\.220\.415\.813\.0ShortGPT10\.40\.00\.00\.00\.020\.80\.00\.00\.00\.031\.00\.00\.00\.00\.041\.20\.00\.00\.00\.051\.40\.00\.00\.00\.0MindSkip10\.60\.40\.00\.40\.021\.40\.60\.00\.40\.031\.40\.80\.00\.40\.041\.81\.00\.00\.40\.051\.81\.40\.00\.40\.0FlexiDepth130\.08\.83\.21\.01\.2243\.615\.06\.42\.21\.8349\.620\.08\.62\.83\.0455\.422\.411\.03\.44\.6560\.024\.012\.03\.85\.0DR\.LLM16\.86\.64\.62\.24\.2214\.29\.87\.83\.67\.2317\.411\.89\.65\.47\.4420\.813\.011\.27\.09\.4521\.614\.612\.68\.010\.2PoLar144\.424\.420\.211\.812\.8250\.628\.022\.212\.816\.4354\.231\.425\.014\.819\.0457\.434\.626\.817\.221\.0559\.840\.628\.218\.022\.8Δ\\Deltavs\. Base \(sampling\)5\+17\.6\+10\.4\+7\.8\+2\.2\+9\.8
Table 7:Pass@k accuracy under different inference strategies applied to Qwen3\-8B\.For difficulty levels DM\-1 to DM\-5,PoLarconsistently outperforming the Base program across allkkvalues, whereτ=0\\tau=0denotes zero temperature\.Methodp@DM\-1DM\-2DM\-3DM\-4DM\-5Qwen3\-8BBase \(τ\\tau=0\)141\.628\.819\.012\.013\.0Base \(sampling\)137\.023\.09\.29\.67\.0242\.025\.611\.610\.68\.8345\.826\.012\.411\.49\.8447\.026\.612\.612\.810\.2548\.427\.813\.613\.410\.6ShortGPT10\.00\.00\.00\.00\.020\.00\.00\.00\.00\.030\.00\.00\.00\.00\.040\.00\.00\.00\.00\.050\.00\.00\.00\.00\.0MindSkip10\.00\.00\.00\.00\.020\.00\.00\.00\.00\.030\.00\.00\.00\.00\.040\.00\.00\.00\.00\.050\.00\.00\.00\.00\.0FlexiDepth149\.629\.48\.81\.81\.0267\.042\.213\.83\.22\.0375\.850\.616\.44\.22\.8481\.255\.618\.45\.24\.8584\.859\.219\.46\.86\.6DR\.LLM12\.00\.60\.60\.22\.022\.40\.60\.80\.22\.432\.40\.60\.80\.22\.442\.40\.60\.80\.22\.852\.41\.00\.80\.22\.8PoLar143\.230\.420\.813\.514\.9248\.433\.221\.216\.014\.6350\.841\.022\.218\.017\.2453\.444\.024\.818\.417\.6555\.644\.825\.418\.817\.8Δ\\Deltavs\. Base \(sampling\)5\+7\.2\+17\.0\+11\.8\+5\.4\+7\.2
### C\.2Out\-of\-Distribution Generalization
We further evaluate the out\-of\-distribution \(OOD\) generalization ofPoLaron benchmarks that differ substantially from DART\-Math in both format and domain\. Tables[8](https://arxiv.org/html/2606.06574#A3.T8),[9](https://arxiv.org/html/2606.06574#A3.T9), and[10](https://arxiv.org/html/2606.06574#A3.T10)report pass@1 accuracy on ASDiv, MAWPS, and subject\-wise subsets of MMLU\-Pro using LLaMA\-3\.2\-3B\-Instruct, Qwen2\.5\-3B\-Instruct, and Qwen3\-8B, respectively\.
Across all evaluated models,PoLarconsistently improves over standard inference on arithmetic word problem benchmarks \(ASDiv and MAWPS\), indicating strong transfer from structured mathematical reasoning to natural language problem settings\. On MMLU\-Pro, which spans diverse domains including mathematics, natural sciences, social sciences, and humanities,PoLarachieves broad and consistent gains across most subject areas\.
These results suggest that the execution programs learned byPoLarcapture general, transferable computation control strategies rather than dataset\-specific heuristics\. Despite being trained on mathematical reasoning data,PoLargeneralizes effectively to heterogeneous domains, highlighting the robustness of program\-of\-layers inference and its applicability beyond the original training distribution\.
Table 8:Out\-of\-distribution \(OOD\) performance at pass@1\.We report accuracy on OOD benchmarks using LLaMA\-3\.2\-3B\-Instruct\.MMLU\-ProMethodASDivMAWPSMathPhysChemLawEngOtherEconHealthPsychBusBioPhilHistBase \(τ\\tau=0\)78\.471\.519\.519\.018\.617\.119\.822\.436\.128\.533\.022\.244\.923\.828\.3ShortGPT2\.34\.629\.926\.926\.923\.320\.133\.036\.735\.338\.228\.636\.041\.740\.9MindSkip9\.07\.540\.447\.743\.331\.843\.738\.547\.647\.251\.531\.752\.349\.755\.6FlexiDepth4\.71\.37\.85\.94\.86\.46\.34\.75\.95\.93\.66\.04\.97\.44\.7DR\.LLM75\.461\.313\.614\.416\.210\.016\.49\.013\.610\.615\.213\.019\.88\.223\.1PoLar81\.473\.740\.139\.041\.034\.343\.242\.950\.152\.452\.432\.756\.644\.548\.8
Table 9:Out\-of\-distribution \(OOD\) performance at pass@1\.We report accuracy on OOD benchmarks using Qwen2\.5\-3B\-Instruct\.MMLU\-ProMethodASDivMAWPSMathPhysChemLawEngOtherEconHealthPsychBusBioPhilHistBase \(τ\\tau=0\)49\.536\.226\.328\.727\.724\.437\.033\.947\.540\.352\.631\.362\.233\.734\.6ShortGPT1\.31\.28\.99\.310\.012\.112\.112\.812\.613\.48\.911\.510\.39\.812\.1MindSkip8\.03\.57\.67\.58\.36\.09\.57\.48\.69\.87\.58\.57\.59\.46\.8FlexiDepth2\.70\.69\.89\.310\.510\.411\.411\.510\.811\.59\.98\.210\.611\.411\.0DR\.LLM0\.00\.04\.04\.82\.83\.23\.03\.22\.82\.61\.44\.65\.83\.04\.2PoLar78\.157\.732\.135\.633\.027\.641\.538\.953\.046\.557\.433\.466\.038\.935\.4
Table 10:Out\-of\-distribution \(OOD\) performance at pass@1\.We report accuracy on OOD benchmarks using Qwen3\-8B\.MMLU\-ProMethodASDivMAWPSMathPhysChemLawEngOtherEconHealthPsychBusBioPhilHistBase \(τ\\tau=0\)67\.152\.121\.424\.819\.232\.623\.443\.060\.456\.262\.826\.672\.242\.350\.1ShortGPT0\.00\.00\.00\.00\.00\.00\.00\.00\.00\.00\.00\.00\.00\.00\.0MindSkip20\.913\.126\.133\.931\.558\.032\.167\.064\.768\.071\.936\.668\.868\.565\.9FlexiDepth1\.00\.40\.91\.00\.60\.60\.50\.80\.91\.00\.10\.81\.00\.61\.3DR\.LLM69\.40\.00\.60\.21\.81\.80\.80\.61\.21\.42\.01\.20\.80\.60\.8PoLar72\.866\.926\.834\.623\.448\.735\.857\.165\.267\.472\.434\.175\.456\.762\.0
## Appendix DEmpirical Details
### D\.1Dataset Details
All in\-distribution experiments are conducted onDART\-Math, a structured mathematical reasoning benchmark consisting of five difficulty levels, denoted asDM\-1toDM\-5\.
Each difficulty level contains 2,000 problem instances, resulting in 10,000 examples in total across all difficulty levels\. We adopt a difficulty\-wise data split, where each difficulty level is split independently into training, validation, and test sets\. Specifically, for each difficulty level, we use:
- •1,250 examples for training,
- •250 examples for validation, and
- •500 examples for testing\.
Overall, this results in 6,250 training examples, 1,250 validation examples, and 2,500 test examples\. All in\-distribution results are reported on the held\-out test sets corresponding to the same difficulty level used for training\.
### D\.2Training Configuration
We train thePoLarprediction network using supervised learning on the training splits described above\. All hyperparameters are selected via validation tuning\.
#### Optimization\.
We use the AdamW optimizer and tune the learning rate from \{1e\-4, 3e\-4, 5e\-4, 8e\-4, 1e\-3, 3e\-3\}, the batch size from \{32, 128, 256\}, and the number of training epochs from \{3, 10\} based on validation performance\. We adopt a cosine learning rate schedule with a linear warmup ofwarmup\_steps= 10 steps\. Unless otherwise specified, the best\-performing configuration on the validation set is used for reporting results\.
#### Output length\.
The maximum number of generated output tokens is set to50, consistent with the output length used when collecting supervision via MCTS, as the model is instructed to directly generate the final answer rather than intermediate reasoning steps\.
### D\.3Predictor Size
The learnedPoLarpredictor is lightweight compared with the frozen base LLM\. Across the evaluated models, the predictor contains approximately 2\.1M parameters, corresponding to only 0\.01%–0\.06% of the base model size\. This small footprint makes training and inference inexpensive relative to standard LLM fine\-tuning or full\-model execution\.
Table 11:Parameter size of the learnedPoLarpredictor compared with each frozen base LLM\.ModelBase LLM paramsPredictor paramsPredictor / BaseQwen1\.5\-MoE\-A2\.7B\-Chat14\.32B2\.11M0\.0148%Qwen2\.5\-3B\-Instruct3\.40B2\.12M0\.0623%Qwen3\-8B8\.19B2\.12M0\.0258%LLaMA\-3\.2\-3B\-Instruct3\.61B2\.11M0\.0586%
### D\.4Direct Prompting
We adopt a direct prompting strategy throughout all experiments, without eliciting chain\-of\-thought or intermediate reasoning\. The model is explicitly instructed to output only the final answer in a strictly formatted form\.
Given a math problem instancequestion, the input prompt is constructed as follows:
```
Solve the following math problem and output ONLY the final answer directly,
formatted strictly as \boxed{ANSWER}.
### Problem Start
{question}
### Problem End
Answer:
```
This prompt design enforces concise answer generation and isolates the effect of latent execution programs from token\-level reasoning strategies\.Similar Articles
LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models
LoopUS is a post-training framework that converts pretrained LLMs into looped architectures for improved reasoning performance via latent-refinement and adaptive early exiting. It addresses computational costs and capability preservation issues found in existing looped computation methods.
Dominant-Layer ZO: A Single Layer Dominates Zeroth-Order Fine-Tuning of LLMs
This paper reveals that zeroth-order fine-tuning of LLMs is dominated by a single decoding layer, which can be identified by activation outliers, and fine-tuning only that layer matches or exceeds full-model fine-tuning with up to 4.52x speedup.
JumpLoRA: Sparse Adapters for Continual Learning in Large Language Models
JumpLoRA introduces a novel sparse adapter framework for continual learning in LLMs using JumpReLU gating to dynamically isolate task parameters and prevent catastrophic forgetting. The method enhances LoRA-based approaches and outperforms state-of-the-art continual learning methods like ELLA.
LayerRoute: Input-Conditioned Adaptive Layer Skipping via LoRA Fine-Tuning for Agentic Language Models
LayerRoute is a lightweight adapter that selectively skips transformer blocks during inference based on input type, achieving compute savings while maintaining or improving model quality through gated routing and LoRA adaptation. It achieves a 12.91% skip differential on agentic language models.
Don't let the LLM speak, just probe it (8 minute read)
The article introduces a technique that extracts hidden states from an LLM at the last prompt token to perform classification without text generation, using a small MLP to read the model's internal decision, enabling fast and cheap zero-shot classifiers.