Two-dimensional early exit optimisation of LLM inference

arXiv cs.CL Papers

Summary

Authors propose a 2D early-exit method that jointly trims layers and input sentences, yielding 1.4–2.3× extra speed-up on sentiment tasks across Llama 3.1/3.2, Gemma and Qwen models.

arXiv:2604.18592v1 Announce Type: new Abstract: We introduce a two-dimensional (2D) early exit strategy that coordinates layer-wise and sentence-wise exiting for classification tasks in large language models. By processing input incrementally sentence-by-sentence while progressively activating deeper layers, our method achieves multiplicative computational savings that exceed those from optimizing either dimension independently. Experimental evaluation across four state-of-the-art LLMs (Llama 3.1, Llama 3.2, Gemma, Qwen; 3B-8B parameters) on three sentiment classification datasets demonstrates additional speed-ups of 1.4--2.3$\times$ over optimal layer-wise early exit for simpler tasks with vanilla models, with graceful degradation on complex multi-class problems. Fine-tuning reduces but does not eliminate this advantage. The approach is model-agnostic, requires only lightweight classification adapters, and is orthogonal to complementary efficiency methods such as quantization and pruning. Our findings indicate that 2D early exit strategies excel when semantic information accumulates predictably across input structure, suggesting possible applicability to sequence-processing tasks beyond sentiment classification.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/22/26, 08:29 AM

# Two-dimensional early exit optimisation of LLM inference
Source: [https://arxiv.org/html/2604.18592](https://arxiv.org/html/2604.18592)
Jan Hůla1David Adamczyk1Tomáš Filip1Martin Pavlíček1Petr Sosík1,2 1Institute for Research and Applications of Fuzzy Modelling, University of Ostrava, 70200 Ostrava, Czech Republic 2Institute of Computer Science, Faculty of Philosophy and Science, Silesian University in Opava, 74601 Opava, Czech Republic \{jan\.hula, david\.adamczyk, tomas\.filip, martin\.pavlicek, petr\.sosik\}@osu\.cz

###### Abstract

We introduce a two\-dimensional \(2D\) early exit strategy that coordinates layer\-wise and sentence\-wise exiting for classification tasks in large language models\. By processing input incrementally sentence\-by\-sentence while progressively activating deeper layers, our method achieves multiplicative computational savings that exceed those from optimizing either dimension independently\. Experimental evaluation across four state\-of\-the\-art LLMs \(Llama 3\.1, Llama 3\.2, Gemma, Qwen; 3B\-8B parameters\) on three sentiment classification datasets demonstrates additional speed\-ups of 1\.4–2\.3×\\timesover optimal layer\-wise early exit for simpler tasks with vanilla models, with graceful degradation on complex multi\-class problems\. Fine\-tuning reduces but does not eliminate this advantage\. The approach is model\-agnostic, requires only lightweight classification adapters, and is orthogonal to complementary efficiency methods such as quantization and pruning\. Our findings indicate that 2D early exit strategies excel when semantic information accumulates predictably across input structure, suggesting possible applicability to sequence\-processing tasks beyond sentiment classification\.

Keywords:Large language model, Inference optimisation, Early exit, Input trimming

## 1Introduction

Large Language Models \(LLMs\) have revolutionised Natural Language Processing \(NLP\) with their remarkable capabilities across a range of tasks\. However, their extraordinary performance comes at a significant computational cost in terms of processing power, memory, and energy\. This applies especially to the inference regime, which is run extensively once the model is trained, while training is a very demanding but usually one\-time task\. For relatively simpler but high\-throughput tasks such as classification, deploying full\-scale LLMs represents an immense computational overhead\. To this end, early exit strategies have emerged as a promising paradigm\. They enable a model to terminate its computation at an intermediate layer when it is sufficiently confident in its prediction, while striving to maintain high accuracy\.

In this paper, we introduce a two\-dimensional \(2D\) early exit optimisation strategy that combines layer\-wise and sentence\-wise early exit\. The intuition behind it is that, in many classification tasks, the proper class can already be determined from an initial part of the text\. To preview the anatomy of the 2D EE strategy, Figure[1](https://arxiv.org/html/2604.18592#S1.F1)demonstrates how the probability of correct classification progresses along sentences and layers during document processing\. Taking a sentence as a natural semantic unit, we split an input sample into sentences and feed an LLM input with sentences one\-by\-one\. Simultaneously, with each new sentence we increment the number of layers processing the so\-far read part of the input\. This results in rectangular processing blocks growing in 2D \(layers vs\. sentences\) illustrated in Fig\.[2](https://arxiv.org/html/2604.18592#S3.F2)\. The classification halts once the accumulated confidence \([5](https://arxiv.org/html/2604.18592#S3.E5)\) exceeds a pre\-defined threshold\. In this way, a significant speed\-up can be achieved compared to standard layer\-wise early exit methods\. Our findings indicate a possible applicability to a range of sequence\-processing tasks, where each input sample can be meaningfully split into a sequence of semantic units that contribute to the overall meaning in an incremental manner\. The main contributions are:

- •Methodology:We integrate layer\-wise early stopping with sentence\-wise input trimming into one synergistic process\. By simultaneously coordinating exiting in both dimensions we achieve multiplicative rather than additive computational savings without sacrificing accuracy\.
- •Benchmarking:We evaluate the 2D strategy against optimal layer\-wise early exit and LayerSkip across four state\-of\-the\-art LLMs \(Llama\-3\.1\-8B, Llama\-3\.2\-3B, Gemma\-3n\-E4B, Qwen2\.5\-7B\) on three sentiment classification datasets of varying difficulty\. The results demonstrate 1\.4–2\.3×\\timesadditional speedup \(relative to layer\-based early exit\) for simpler tasks in vanilla models, with graceful degradation on harder problems\.
- •Analysis and insights:We analyse how fine\-tuning affects the 2D advantage, provide hyperparameter optimisation guidance, and identify overall conditions for 2D strategies to excel\.

Gemma\-3n\-E4B vanilla \(adapter\-only training\)

![Refer to caption](https://arxiv.org/html/2604.18592v1/x1.png)

Gemma\-3n\-E4B fine\-tuned

![Refer to caption](https://arxiv.org/html/2604.18592v1/x2.png)

Figure 1:Layer\-wise and sentence\-wise accuracy of correct classification of Gemma\-3n\-E4B vanilla \(top\) and fine\-tuned \(bottom\) on the MMS dataset\. Heatmaps based on 7100 samples of 10 sentences each\.The rest of the paper is structured as follows\. Section[2](https://arxiv.org/html/2604.18592#S2)reviews related work\. Section[3](https://arxiv.org/html/2604.18592#S3)outlines the methodology used, and Section[4](https://arxiv.org/html/2604.18592#S4)resumes and analyses the experimental results\. Section[6](https://arxiv.org/html/2604.18592#S6)discusses the implications of the work, and Section[7](https://arxiv.org/html/2604.18592#S7)concludes the paper with directions for future extensions\.

## 2Background

Traditional methods to improve inference efficiency in LLMs include pruning\[[18](https://arxiv.org/html/2604.18592#bib.bib25)\], weight quantisation\[[3](https://arxiv.org/html/2604.18592#bib.bib24)\], and knowledge distillation\[[10](https://arxiv.org/html/2604.18592#bib.bib27)\]\. These techniques compress the model statically; that is, the resulting model operates with fixed structure regardless of input\. Although effective, they often cause accuracy degradation and may overfit to specific datasets or domains, reducing transferability between tasks\. In contrast, dynamic inference methods adapt computation at runtime\. A classical dynamic approach is to employ acascadeof models with increasing complexity, where a lightweight model handles easy instances, while only harder ones are passed to a larger model\[[1](https://arxiv.org/html/2604.18592#bib.bib30)\]\. TheMatryoshkaparadigm\[[12](https://arxiv.org/html/2604.18592#bib.bib36)\]generalises this idea into a cascade of progressively complex nested models\. MatFormer\[[4](https://arxiv.org/html/2604.18592#bib.bib22)\]integrates feedforward network blocks within transformer layers, allowing the extraction of multiple smaller models from one training process at no additional cost\.

### 2\.1Early exit strategy

Early exit methods reduce inference cost by terminating the computation in intermediate layers once a predefined criterion \(theexit trigger\) is met\. They are complementary to pruning, quantisation, or distillation and are applicable to virtually any transformer model without fundamental architectural change\. The foundational paper\[[24](https://arxiv.org/html/2604.18592#bib.bib44)\]introduced the early exit paradigm for deep neural networks\. For an overview of the topic, we refer, e\.g\., to the survey\[[13](https://arxiv.org/html/2604.18592#bib.bib38)\]or\[[19](https://arxiv.org/html/2604.18592#bib.bib35)\]\.

Confidence\-based exits\.Early works attach auxiliary classifiers to intermediate layers and allow inference to stop once confidence exceeds a threshold\. DeeBERT\[[25](https://arxiv.org/html/2604.18592#bib.bib32)\]pioneered entropy\-based exits \(off ramps\) for BERT, achieving up to40%40\\%inference time reduction on GLUE dataset benchmarks with a drop in accuracy<4%<4\\%\. Schwartz et al\.\[[22](https://arxiv.org/html/2604.18592#bib.bib6)\]proposed calibrated confidence thresholds, showing that multi\-exit BERTs could achieve33–5×5\\timesfaster inference with almost the same accuracy\. To strengthen shallow exits, FastBERT\[[17](https://arxiv.org/html/2604.18592#bib.bib8)\]applied self\-distillation and adaptable inference, producing reliable intermediate predictions and allowing tunable11–12×12\\timesspeed\-ups with graceful accuracy degradation\. PABEE\[[28](https://arxiv.org/html/2604.18592#bib.bib7)\]introduces the exit strategy based on an unchanged prediction of internal classifiers for a predefined number of consecutive steps\.

Learned exit policies\.Rather than fixed heuristics, some approaches learn when to exit\. Li et al\.\[[14](https://arxiv.org/html/2604.18592#bib.bib37)\]introduced a predictive engine for early exiting decisions that allows the network to forecast when it should exit, adapting to computational and energy constraints in real\-time\. ConsistEE\[[27](https://arxiv.org/html/2604.18592#bib.bib17)\]employs reinforcement learning to align training\-time supervision with inference\-time exits, reporting double digit percentage layer depth reductions with the same or better accuracy\. Schuster et al\.\[[21](https://arxiv.org/html/2604.18592#bib.bib12)\]proposed Confident Adaptive Transformers \(CAT\), combining additional prediction heads with meta consistency classifiers to guarantee confidence, while still obtaining1\.81\.8–2\.7×2\.7\\timesacceleration\. Fan et al\.\[[7](https://arxiv.org/html/2604.18592#bib.bib40)\]uses statistical classifiers such as SVM trained on easily obtainable token\-based features to adaptively decide when to stop inference\.

Architecture\-integrated approaches\.Some methods modify training or architecture to support the exit procedure\. LayerSkip\[[6](https://arxiv.org/html/2604.18592#bib.bib19)\]introduces layer dropout with progressively higher dropout rates for deeper layers\. The early exit loss supervises all intermediate layers using a weighted combination\. Self\-speculative decoding enables flexible skipping, resulting in1\.821\.82–2\.16×2\.16\\timesspeedup on coding, parsing and summarization tasks with marginal accuracy loss\. Liu et al\.\[[15](https://arxiv.org/html/2604.18592#bib.bib41)\]uses a fixed shallow sub\-network as a self\-draft model with a controlled exit, while the remaining layers serve as the larger target model\. GREEN\-CODE\[[8](https://arxiv.org/html/2604.18592#bib.bib16)\]uses reinforcement learning to optimise exit policy in code generation, demonstrating up to2323–50%50\\%energy savings in Llama\-3\.2\-3B and OPT 2\.7B\. Chen et al\.\[[2](https://arxiv.org/html/2604.18592#bib.bib39)\]introduces a variety of optimisations targeted both in the training and inference phase, enabling early\-exit LLMs to scale up to 30B parameters\.

Token\-level and sequence\-level extensions\.Many works, including some already cited\[[17](https://arxiv.org/html/2604.18592#bib.bib8),[6](https://arxiv.org/html/2604.18592#bib.bib19)\], combine layer\-wise early exit with token\-wise exiting, where different tokens exit at different layers\. At the sequence level, token\-wise halting was pioneered by the Depth\-Adaptive Transformer\[[5](https://arxiv.org/html/2604.18592#bib.bib13)\], which achieved faster decoding in machine translation\. CALM \(Confident Adaptive Language Modeling\)\[[20](https://arxiv.org/html/2604.18592#bib.bib14)\]introduced dynamic allocation of different amounts of compute per input and achieved up to3×3\\timesspeed\-up on generative tasks\. Khanna et al\.\[[11](https://arxiv.org/html/2604.18592#bib.bib3)\]presented a model\-agnostic inference framework that reduces computation via dynamic token halting, combined with KV\-cache skipping and token\-level redundancy reduction, achieving up to 40% resource reduction on frozen models\. Recent work extends early exit to reasoning tasks: DEER\[[26](https://arxiv.org/html/2604.18592#bib.bib1)\]monitors model confidence at CoT transition points to terminate reasoning sequences early \(19–80% reduction in CoT length\), while NEAT\[[16](https://arxiv.org/html/2604.18592#bib.bib43)\]monitors neuron\-level activation dynamics for fine\-grained exit decisions \(22–28% token reduction across CoT reasoning benchmarks\)\.

Our 2D early exit strategy shares with CALM\[[20](https://arxiv.org/html/2604.18592#bib.bib14)\]and token\-halting approaches\[[5](https://arxiv.org/html/2604.18592#bib.bib13),[11](https://arxiv.org/html/2604.18592#bib.bib3)\]the intuition that not every input unit warrants equal computational depth\. However, these methods allocate variable compute per token within a fixed context, requiring careful handling of absent hidden representations and KV\-cache management\. By contrast, operating at the sentence level – a natural semantic boundary – ensures well\-defined embeddings at every active layer, avoiding these complications and leveraging the predictable accumulation of sentiment signal across sentences\. Furthermore, by coordinating sentence\-wise input trimming with layer\-wise early stopping within a unified framework, we achieve multiplicative rather than additive resource reductions in classification tasks\.

## 3Methods

The core idea of 2D early exit in transformers is to combine layer\-wise and token\-wise exiting into one framework that acts simultaneously in both dimensions to obtain a synergistic effect\. We have chosen a sentence as the elementary input semantic unit\. Each sample in the dataset is split into sentences by the algorithm detailed in the Appendix[A](https://arxiv.org/html/2604.18592#A1)\. Throughout the paper,sentence embeddingis defined as the mean of the embeddings of all tokens in the sentence\. Each tested model is supplemented with lightweight classification adapters \(or classifiers, for short\), one per layer, fed with the output embedding of that layer\. The classifier architecture is a two\-layer fully connected network:

- •Linear\(embedding\_dim→\\rightarrow256\) \+ RelU
- •Linear\(256→\\rightarrownum\_classes\) \+ Softmax

### 3\.1Training

The loss function used for training is based on that used in\[[25](https://arxiv.org/html/2604.18592#bib.bib32),[28](https://arxiv.org/html/2604.18592#bib.bib7),[6](https://arxiv.org/html/2604.18592#bib.bib19)\]that aims at accurate predictions at intermediate layers\. Simultaneously, the loss function was expanded to promote correct classification based on a partial input\. Consider a model withLLtransformer layers\. For a sample\(x,y\)\(x,y\)consisting ofmmsentences, denote byei,k​\(x\)e\_\{i,k\}\(x\)the embedding of sentencek,k,0≤k≤m−1,0\\leq k\\leq m\-1,of samplexxin the output of layeri,i,0≤i≤L−1\.0\\leq i\\leq L\-1\.

Let us define theprefix embeddingp​ei,j​\(x\)pe\_\{i,j\}\(x\)as the mean embedding of the firstjjsentences of the sample,

p​ei,j​\(x\)=∑k=0j−1ei,k​\(x\)j,0≤j≤m−1\.pe\_\{i,j\}\(x\)=\\sum\_\{k=0\}^\{j\-1\}\\dfrac\{e\_\{i,k\}\(x\)\}\{j\},\\quad 0\\leq j\\leq m\-1\.\(1\)Two variants of training have been tested: \(i\) adapter\-only training and \(ii\) fine\-tuning of the whole model\.

#### Adapter\-only training

In this variant, only the classifiers for individual layers are trained, while the model weights remain frozen\. For a training sample\(x,y\)\(x,y\)ofmmsentences, each classifier is optimised to predict the target labelyysolely based on the output embedding of the corresponding layer\. The sentence embeddings are aggregated using the mean operation and finally processed by the classification adapter, resulting in the loss function

ℒi​\(x,y\)=CE​\(y,fci​\(∑j=0m−1ei,j​\(x\)m\)\),\\mathcal\{L\}\_\{i\}\(x,y\)=\\mathrm\{CE\}\\Big\(y,\\mathrm\{fc\}\_\{i\}\\Big\(\\sum\_\{j=0\}^\{m\-1\}\\dfrac\{e\_\{i,j\}\(x\)\}\{m\}\\Big\)\\Big\),\(2\)wherefci​\(⋅\)\\mathrm\{fc\}\_\{i\}\(\\cdot\)is the output of theii\-th classification adapter \(class probability distribution\) andCE​\(⋅,⋅\)\\mathrm\{CE\}\(\\cdot,\\cdot\)is the cross\-entropy function\.

#### Fine\-tuning

The fine tuning of all tested models, including the classifiers, is implemented with the PEFT adapter\. The aggregate loss function

ℒ​\(x,y\)=∑i=0L−1λi​CE​\(y,fci​\(∑j=0m−1p​ei,j​\(x\)m\)\)\\mathcal\{L\}\(x,y\)=\\sum\_\{i=0\}^\{L\-1\}\\lambda\_\{i\}\\ \\mathrm\{CE\}\\Big\(y,\\mathrm\{fc\}\_\{i\}\\Big\(\\sum\_\{j=0\}^\{m\-1\}\\dfrac\{pe\_\{i,j\}\(x\)\}\{m\}\\Big\)\\Big\)\(3\)for a sample\(x,y\)\(x,y\)is a weighted average of layer losses\. Each layer loss is based on the output of the layer classifier, where its input is an average of all prefix embeddings \([1](https://arxiv.org/html/2604.18592#S3.E1)\) for this layer and sample\. The weightsλi\\lambda\_\{i\}were experimentally set to 0\.8 for the last layerL−1L\-1and 0\.1 for all previous layers\. These values provided the best accuracy across the three tested datasets; however, they may be further auto\-tuned for different datasets\.

### 3\.2Inference

012389181945671011202112131415161722232425262728293031Layer 0Layer 1Layer 2Layer 3Layer 4Layer 5Layer 6Layer 7sentence 0sentence 1sentence 2sentence 3⋯\\cdots⋯\\cdots⋯\\cdots⋯\\cdots⋯\\cdots⋯\\cdots⋯\\cdots⋯\\cdots⋮\\vdots⋮\\vdots⋮\\vdots⋮\\vdots⋅\\cdot⋅\\cdot⋅\\cdotFigure 2:Visualization of the 2D early exit strategy for step sizeΔ=2,\\Delta=2,see eq\. \([4](https://arxiv.org/html/2604.18592#S3.E4)\)\. Arrows Red frames enclose inference progression blocks corresponding to individual sentences\.The key innovation of the 2D early exit strategy, detailed in Algorithm[1](https://arxiv.org/html/2604.18592#algorithm1), is the two\-dimensional progression: the algorithm processes each input sample sentence\-by\-sentence, with each new sentence progressively increasing the number of available network layers for inference\. For an input sample withmmsentences, when processing a sentence indexed withss,0≤s≤m−1,0\\leq s\\leq m\-1,the algorithm activates the firstLsL\_\{s\}layers of the model,

Ls=min⁡\(\(s\+1\)×Δ,L\),where​Δ=max⁡\(1,⌊L/m⌋\)\.L\_\{s\}=\\min\(\(s\+1\)\\times\\Delta,L\),\\mbox\{ where \}\\Delta=\\max\(1,\\lfloor L/m\\rfloor\)\.\(4\)
All previous sentencess1<ss\_\{1\}<spass through newly activated layers they have not yet passed through, i\.e\., fromLs−1L\_\{s\-1\}toLs−1,L\_\{s\}\-1,while sentencesspasses through all layers from 0 toLs−1,L\_\{s\}\-1,resulting in red\-framed blocks illustrated in Fig\.[2](https://arxiv.org/html/2604.18592#S3.F2)\. Hence, the layer depth is coupled to the sentence index, so each new sentence simultaneously expands both contextual breadth and representational dept

Early stopping is controlled by two hyperparameters: ignore thresholdτi​g​n​o​r​e\\tau\_\{ignore\}and acceptance thresholdτa​c​c\.\\tau\_\{acc\}\.Denote byfcℓ,c​\(s1\)\\mathrm\{fc\}\_\{\\ell,c\}\(s\_\{1\}\)the softmax output \(probability\) of the classification adapter in layerℓ\\ellfor the inputs1,s\_\{1\},corresponding to classc\.c\.For eachℓ\\ellands1,s\_\{1\},the algorithm computesconfidenceas the difference between the maximum and second\-maximum softmax outputs:

confidence=maxc⁡\{fcℓ,c​\(s1\)\}−second\_​maxc⁡\{fcℓ,c​\(s1\)\}\.\\text\{confidence\}=\\max\\limits\_\{c\}\\\{\\mathrm\{fc\}\_\{\\ell,c\}\(s\_\{1\}\)\\\}\-\\text\{second\\\_\}\\max\\limits\_\{c\}\\\{\\mathrm\{fc\}\_\{\\ell,c\}\(s\_\{1\}\)\\\}\.\(5\)
If confidence exceedsτi​g​n​o​r​e\\tau\_\{ignore\}, it is added to the accumulated confidence score of the predicted class\. When the accumulated confidence of any class exceedsτa​c​c,\\tau\_\{acc\},the inference terminates immediately, returning the prediction with the highest score\. If no class reaches theτa​c​c\\tau\_\{acc\}threshold after processing all layers and sentences, the algorithm returns the prediction of the final layer for the last sentence\.

The 2D early exit inference pseudocode

\[0\.5em\]

Input:e​x​a​m​p​l​e:example:input textτi​g​n​o​r​e,τa​c​c:\\tau\_\{ignore\},\\tau\_\{acc\}:hyperparameters – ignore and acceptance thresholdOutput:p​r​e​d​i​c​t​e​d​\_​l​a​b​e​l:predicted\\\_label:prediction for the input exampleo​p​e​r​a​t​i​o​n​s​\_​u​s​e​d:operations\\\_used:consumed abstract operations to measure speed\-up
1\. Initialization:

\[0\.5em\]

s​e​n​t​e​n​c​e​s←sentences\\leftarrowlist of sentences inexample

n​u​m​\_​s​e​n​t​e​n​c​e​s←num\\\_sentences\\leftarrownumber of sentences inexample

n​u​m​\_​l​a​y​e​r​s←num\\\_layers\\leftarrownumber of layers of the model

o​p​e​r​a​t​i​o​n​s​\_​u​s​e​d←0operations\\\_used\\leftarrow 0\(counter of operations used\)

Δ←max⁡\(1,n​u​m​\_​l​a​y​e​r​s​//​n​u​m​\_​s​e​n​t​e​n​c​e​s\)\\Delta\\leftarrow\\max\(1,num\\\_layers\\text\{ // \}num\\\_sentences\)\(step size in layers\)

a​c​c←\[0,0,0,…\]acc\\leftarrow\[0,0,0,\.\.\.\]\(confidence accumulators for each class\)

2\. Process sentences and layers till early stopping:

for*s=0s=0ton​u​m​\_​s​e​n​t​e​n​c​e​s−1num\\\_sentences\-1*do

// Process all sentences up to

ss
for*s1=0s\_\{1\}=0toss*do

// The number of active layers for sentences up to

ss
l​a​y​e​r​s​\_​t​o​\_​t​r​a​v​e​r​s​e=min⁡\(\(s\+1\)×Δ,n​u​m​\_​l​a​y​e​r​s\)layers\\\_to\\\_traverse=\\min\(\(s\+1\)\\times\\Delta,num\\\_layers\)
// Sentences before

sspass only newly added chunk of layers

if

s1=ss\_\{1\}=sthen

s​t​a​r​t​\_​l​a​y​e​r=0start\\\_layer=0
else

s​t​a​r​t​\_​l​a​y​e​r=Δ∗sstart\\\_layer=\\Delta\*s
for*l=s​t​a​r​t​\_​l​a​y​e​rl=start\\\_layertol​a​y​e​r​s​\_​t​o​\_​t​r​a​v​e​r​s​e−1layers\\\_to\\\_traverse\-1*do

o​p​e​r​a​t​i​o​n​s​\_​u​s​e​d←o​p​e​r​a​t​i​o​n​s​\_​u​s​e​d\+1operations\\\_used\\leftarrow operations\\\_used\+1
// Calculate embeddings for the actual sentence using already stored

// embeddings for previous layers and sentences

e​m​b​e​d​d​i​n​g​s​\[l\]​\[s1\]←c​a​l​c​u​l​a​t​e​\_​e​m​b​e​d​d​i​n​g​\(l,s​e​n​t​e​n​c​e​s​\[s1\]\)embeddings\[l\]\[s\_\{1\}\]\\leftarrow calculate\\\_embedding\(l,sentences\[s\_\{1\}\]\)
// Classifier at layer

llcalculates probabilities for all classes

\[p0,p1,p2,…\]←classifierl​\(e​m​b​e​d​d​i​n​g​s​\[l\]​\[s1\]\)\[p\_\{0\},p\_\{1\},p\_\{2\},\\ldots\]\\leftarrow\\textit\{classifier\}\_\{l\}\(embeddings\[l\]\[s\_\{1\}\]\)
p​r​e​d​i​c​t​e​d​\_​l​a​b​e​l←arg⁡max⁡\(\[p0,p1,p2,…\]\)predicted\\\_label\\leftarrow\\arg\\max\(\[p\_\{0\},p\_\{1\},p\_\{2\},\\ldots\]\)
// Calculate confidence of the winning class \(max \- second max\)

s​o​r​t​e​d​\_​p​r​o​b​s←SORT​\(\[p0,p1,p2,…\]\)sorted\\\_probs\\leftarrow\\text\{SORT\}\(\[p\_\{0\},p\_\{1\},p\_\{2\},\\ldots\]\)descending

c​o​n​f​i​d​e​n​c​e←s​o​r​t​e​d​\_​p​r​o​b​s​\[0\]−s​o​r​t​e​d​\_​p​r​o​b​s​\[1\]confidence\\leftarrow sorted\\\_probs\[0\]\-sorted\\\_probs\[1\]
// Accumulate confidence, filter by ignore threshold

if*c​o​n​f​i​d​e​n​c​e\>τi​g​n​o​r​econfidence\>\\tau\_\{ignore\}*then

a​c​c​\[p​r​e​d​i​c​t​e​d​\_​l​a​b​e​l\]←a​c​c​\[p​r​e​d​i​c​t​e​d​\_​l​a​b​e​l\]\+c​o​n​f​i​d​e​n​c​eacc\[predicted\\\_label\]\\leftarrow acc\[predicted\\\_label\]\+confidence
// Early stopping check

if*a​c​c​\[p​r​e​d​i​c​t​e​d​\_​l​a​b​e​l\]\>τa​c​cacc\[predicted\\\_label\]\>\\tau\_\{acc\}*then

return

\(p​r​e​d​i​c​t​e​d​\_​l​a​b​e​l,o​p​e​r​a​t​i​o​n​s​\_​u​s​e​d\)\(predicted\\\_label,operations\\\_used\)
3\. All layers and sentences processed – no early stopping:

f​i​n​a​l​\_​l​a​b​e​l←final\\\_label\\leftarrowprediction from the last sentence at last layer

return

\(f​i​n​a​l​\_​l​a​b​e​l,n​u​m​\_​l​a​y​e​r​s×n​u​m​\_​s​e​n​t​e​n​c​e​s\)\(final\\\_label,num\\\_layers\\times num\\\_sentences\)
2em

Algorithm 1The 2D early exit inference for one input example\.
### 3\.3Metrics and evaluation

The experiments were evaluated using two metrics: \(i\) classification accuracy and \(ii\) speed\-up enabled by early exit, expressed as the ratio of abstract operations in a full and early terminated inference\. For each model, dataset, and training mode, leta​c​c​\(ℓ\)acc\(\\ell\)be the accuracy of the classifiers at layerℓ\\ellon the test split\. We set the acceptableaccuracy thresholdto

acc\_thr=max0≤ℓ<L⁡\{a​c​c​\(ℓ\)\}−T,\\textit\{acc\\\_thr\}=\\max\\limits\_\{0\\leq\\ell<L\}\\\{acc\(\\ell\)\\\}\-T,\(6\)whereTTis the allowed accuracy loss\. We repeated the experiments forT=0\.02T=0\.02\(2%\) andT=0\.04T=0\.04\(4%\)\. Then, for all EE methods tested, we ran the inference until the accuracy threshold was met and calculated the corresponding speed\-up\. For layer\-wise early exit, the speed\-up was set as

speed\-up=L/\(Le\+1\),\\textit\{speed\-up\}=L/\(L\_\{e\}\+1\),\(7\)whereLLis the total number of transformer layers in the model andLeL\_\{e\}is the exit layer at which the accuracy threshold is met \(layers numbered from 0 toL−1L\-1\)\.

The speed\-up achieved by the 2D EE strategy was quantified as follows\. We count the processing of one sentence in one transformer layer as one abstract operation\. When the accuracy threshold is not met, Algorithm[1](https://arxiv.org/html/2604.18592#algorithm1)processes the complete input through all layers usingm×Lm\\times Loperations, which corresponds to an inference with no early exit\. When early stopping occurs, the algorithm returns the number ofoperations\_used, and the speed\-up is calculated as

speed\-up=\(m×L\)/operations\_used\.\\textit\{speed\-up\}=\(m\\times L\)/\\textit\{operations\\\_used\}\.\(8\)One can argue that, since each sentence attends to all previous sentences, its processing time is proportional to its order in the input text\. However, this is true only for long inputs of at least thousands of sentences\. Denote

tpsaverage number of tokens per sentence \(≈15\\approx 15for user reviews\);embed\_dimembedding dimension of the model;exp\_fexpansion factor in the MLP module in the transformer layer\.
Considering, e\.g\., Llama\-3\.2\-3B withembed\_dim=3072\\textit\{embed\\\_dim\}=3072andexp\_f≈2\.67,\\textit\{exp\\\_f\}\\approx 2\.67,computationally intensive operations in a transformer layer processing thess\-th sentence of an input example, are:

Query, key, value computation:≈3×tps×embed\_dim2≈4\.2×108\\approx 3\\times\\textit\{tps\}\\times\\textit\{embed\\\_dim\}^\{2\}\\approx 4\.2\\times 10^\{8\}Attention matrix computation:≈s×tps2×embed\_dim≈6\.9×105×s\\approx s\\times\\textit\{tps\}^\{2\}\\times\\textit\{embed\\\_dim\}\\approx 6\.9\\times 10^\{5\}\\times sMLP expansion and compression:≈2×tps×embed\_dim2×exp\_f≈7\.6×108\\approx 2\\times\\textit\{tps\}\\times\\textit\{embed\\\_dim\}^\{2\}\\times\\textit\{exp\\\_f\}\\approx 7\.6\\times 10^\{8\}
Therefore, the linear term containingssbecomes relevant fors=Ω​\(103\)\.s=\\Omega\(10^\{3\}\)\.

## 4Experiments

The aim of the experiments was to compare the efficiency of the 2D EE strategy with recent approaches based on exiting at intermediate layers\. Numerous studies, such as\[[25](https://arxiv.org/html/2604.18592#bib.bib32),[29](https://arxiv.org/html/2604.18592#bib.bib10),[21](https://arxiv.org/html/2604.18592#bib.bib12),[14](https://arxiv.org/html/2604.18592#bib.bib37),[6](https://arxiv.org/html/2604.18592#bib.bib19),[7](https://arxiv.org/html/2604.18592#bib.bib40),[15](https://arxiv.org/html/2604.18592#bib.bib41)\], and many others, have examined methods of triggering layer\-wise early exit\. They often differ in language models and datasets, and some use additional mechanisms to improve early exit efficiency, which makes comparison difficult or impossible\. Therefore, we compare our 2D EE strategy againstthe optimal EE triggerwith layer\-wise speed\-up calculated by formulas \([6](https://arxiv.org/html/2604.18592#S3.E6)\) and \([7](https://arxiv.org/html/2604.18592#S3.E7)\)\. This baseline provides thesupremum speed\-upof all possible layer\-wise exit triggers, implying that the additional speed\-up achieved with the 2D EE strategy is guaranteed with respect to a range of previous layer\-wise EE studies\.

The experiments were conducted in two training modes described in Section[3\.1](https://arxiv.org/html/2604.18592#S3.SS1)\. For each mode, we compared the results of the optimal layer\-wise early exit described above with those of the 2D early exit\. Finally, we also conducted experiments that compare the 2D EE results with the LayerSkip method\[[6](https://arxiv.org/html/2604.18592#bib.bib19)\]for the Llama\-3\.1\-8B and Llama\-3\.2\-3B models \(as the LayerSkip library is not available for Gemma nor Qwen\)\.

### 4\.1Datasets

We used three publicly available datasets for sentiment analysis\.

1. 1\.
2. 2\.
3. 3\.

From each dataset, we randomly chose a sentiment\-balanced English subset of 15K samples \(10K train, 5K test, English only\) restricted to samples with at least 10 sentences\.

### 4\.2Language models

ModernBERT is the encoder\-only model released in 2024, and we use it as a baseline for our experiments\. The Llama, Gemma, and Qwen families are popular and widely used medium\-sized decoder\-only LLM\. We have chosen four of these models, sized from 3B to 8B parameters\.

### 4\.3Implementation

The experiments were conducted on a single\-node DGX Station A100 with 1× AMD EPYC 7742 \(64 cores / 128 threads\), 4× Nvidia A100 SXM4 GPUs with 40 GB of HBM2 each \(NVLink\-connected\), and approximately 512 GB of system RAM\. We used the same training setup for all models\. The training of one adapter took 2\.7 min and, as the adapters are independent, all can be trained in parallel\. We used the following setting:

- •Optimiser: Adam with a learning rate1×10−51\\times 10^\{\-5\}
- •Batch size: 128
- •Epochs: 20

For fine\-tuning, including ModernBERT and LayerSkip implementations, the LoRA configuration included a rank of 64 and a scaling factor \(alpha\) of 16, resulting in a training time \(Llamma\-3\.1\-8B\) of 66\.9 min:

- •Optimiser: AdamW with a learning rate1×10−41\\times 10^\{\-4\}
- •Batch size: 16
- •Epochs: 2

For the layer\-wise early exit, the model was fed the whole input example, and classifiers at each layer calculated their predictions from the embedding of the last sentence of the example \(which aggregated information from the entire example thanks to the attention mechanism\)\.

The 2D early exit inference was implemented by Algorithm[1](https://arxiv.org/html/2604.18592#algorithm1), controlled by two hyperparametersτi​g​n​o​r​e\\tau\_\{ignore\}andτa​c​c\.\\tau\_\{acc\}\.The algorithm is designed for downstream task inference, where the hyperparameters are already pre\-set\. A systematic exploration of hyperparameter space \(≈250\\approx 250settings\) was needed to explain the behaviour of the algorithm for all 24 combinations of dataset/model/training\_mode – see Fig\.[6](https://arxiv.org/html/2604.18592#A2.F6)for an example\. To optimise time and energy, we pre\-computed and stored embeddings for each sentence of all testing samples and all layers of each model, along with the corresponding metadata\. The embeddings were used to compute the probability vectors\[p0,p1,p2,…\]\[p\_\{0\},p\_\{1\},p\_\{2\},\\ldots\]in the output of the trained classifiers, which were then processed by Algorithm[1](https://arxiv.org/html/2604.18592#algorithm1)\. This sped\-up the experiments by≈\\approxtwo orders of magnitude\.

## 5Results

We compared the efficiency of the 2D early exit strategy with other EE methods tested, based on the metrics detailed in Section[3\.3](https://arxiv.org/html/2604.18592#S3.SS3)\. Generally, the 2D early exit strategy achieved a substantial speed\-up increase across different model architectures for simpler tasks, while its results were inconclusive for the most complex Amazon\-5 dataset\. The 2D EE advantage depends on the task complexity, the training protocol \(adapter\-only training versus model\-fine\-tuning\), and the allowed accuracy loss, as detailed in Sec\.[5\.5](https://arxiv.org/html/2604.18592#S5.SS5)\.

### 5\.1Baseline results

![Refer to caption](https://arxiv.org/html/2604.18592v1/layer_accuracies_Llama_3_1_mms.png)Figure 3:Layer\-wise accuracy of LLama\-3\.1\-8B, vanilla model \(above\) and fine\-tuned model \(bellow\), averaged on the MMS test dataset\. Full circles mark the layers with the best accuracy\. Stars mark the first layers with the accuracy 2% below maximum\.To put results into context with numerous studies using BERT\-based models, we compared the accuracy of ModernBERT Large with the mid\-size LLMs used in our 2D EE study\. The results of the fine\-tuned models in the three sentiment classification datasets are listed in Table[1](https://arxiv.org/html/2604.18592#S5.T1)\. No early exit was used\.

Table 1:Accuracy of the fine\-tuned ModernBERT Large versus other models tested\.As an early exit speed\-up baseline, we use the optimal idealised layer\-wise early exit\. Fig\.[3](https://arxiv.org/html/2604.18592#S5.F3)shows the typical layer\-wise accuracy curves for vanilla and fine\-tuned LLMs \(no early exit\)\. In vanilla models, the layer\-wise accuracy usually reaches its maximum approximately halfway through the model depth and then slightly decreases\. Fine\-tuning increases the accuracy in all models as expected, but also flattens the layer\-wise accuracy curves in the upper layers\. To view this phenomenon from another perspective, Fig\.[5](https://arxiv.org/html/2604.18592#A2.F5)shows the cosine distances between embeddings in consecutive layers of Llama\-3\.1\-8B\.

### 5\.2Results of adapter\-only training

The results are summarised in Table[2](https://arxiv.org/html/2604.18592#S5.T2)\. In some cases marked with \*, the 2D EE strategy did not reach the accuracy threshold \([6](https://arxiv.org/html/2604.18592#S3.E6)\) for a loss of 2%\. For simpler classification tasks \(Steam with 2 classes and MMS with 3 classes\), the 2D approach consistently outperformed the layer\-wise early exit by factors of 1\.4\-2\.3×\\times\(Steam\) and 1\.1\-2\.7×\\times\(MMS\)\. For the most challenging dataset \(Amazon\-5 with 5 classes\), 2D EE dominated only for Gemma and Qwen models, providing 1\.1\-1\.5 speed\-up improvement, while it was unfavourable for Llama models\.

Table 2:Results of the 2D early exit strategy in vanilla models\. The cases when 2D EE did not reach the 2% accuracy loss threshold are marked by \*\.
### 5\.3Results of fine\-tuned models

Fine\-tuning \(Table[3](https://arxiv.org/html/2604.18592#S5.T3)\) significantly reduced the advantage of the 2D EE strategy, which remained clearly superior for the simplest Steam dataset by a factor of 1\.4\-1\.7×\.\\times\.\. In MMS with 3 classes, the advantage was still 1\.05\-1\.25×\\timesfor 4% accuracy loss, but almost vanished for 2% loss\. Finally, the layer\-wise early exit outperformed the 2D EE method on the Amazon\-5 dataset for both 2% and 4% accuracy loss\. This suggests that, while adapter\-only training preserves the layer\-wise accuracy variations beneficial for 2D early exit, model fine\-tuning flattens these variations \(Fig\.[1](https://arxiv.org/html/2604.18592#S1.F1)\), hence reducing the 2D advantage despite improving overall accuracy\. Further analysis is provided in the ablation study, Sect\.[6\.1](https://arxiv.org/html/2604.18592#S6.SS1)\.

Table 3:Results of the 2D early exit strategy in fine\-tuned models\.
### 5\.4LayerSkip experiments

The experiments were carried out for the Llama\-3\.1\-8B and Llama\-3\.2\-3B models, which are supported by the available LayerSkip implementation\. Unlike the original paper\[[6](https://arxiv.org/html/2604.18592#bib.bib19)\], where the exit layer is manually set as one of the hyperparameters, we let LayerSkip pass through all layers and stored the accuracies per layer\. Its best accuracy among all layers was in line with our layer\-wise EE results for all three datasets, confirming the robustness of the methodology used\. For comparison with the 2D EE approach \(Tab\.[4](https://arxiv.org/html/2604.18592#S5.T4)\), we took the first layer in which LayerSkip matched the accuracy threshold \([6](https://arxiv.org/html/2604.18592#S3.E6)\) and calculated the corresponding speed\-up \([7](https://arxiv.org/html/2604.18592#S3.E7)\)\. The speed\-up of the 2D EE strategy was clearly superior for the Steam and MMS datasets\. LayerSkip dominated for the Amazon\-5 dataset with an accuracy loss of 2%, but 2D EE matched or outperformed LayerSkip for 4% loss\.

Accuracy loss 2%Accuracy loss 4%DatasetModelAccLayerSkip2D EELayerSkip2D EELayerSpd\-upSpd\-upLayerSpd\-upSpd\-upSteamLlama\-3\.1\-8B0\.97112\.94\.3103\.26\.9Llama\-3\.2\-3B0\.96112\.54\.1102\.85\.8MMSLlama\-3\.1\-8B0\.80152\.12\.1152\.13\.2Llama\-3\.2\-3B0\.79151\.92\.1151\.92\.9Amazon\-5Llama\-3\.1\-8B0\.65162\.01\.6152\.12\.1Llama\-3\.2\-3B0\.63151\.91\.6151\.92\.1Table 4:Results of LayerSkip compared to the 2D EE technique\.
### 5\.5Analysis of results

The advantage of the 2D EE strategy over layer\-wise early exit \(or the LayerSkip method\) in our experiments depends on several factors\.

#### Task complexity

The 2D EE dominated in simpler tasks \(Steam and MMS dataset with classification to 2 or 3 classes, respectively\), but layer\-wise EE proved more efficient for the most challenging task \(Amazon\-5 with 5 classes\), except for two test cases, suggesting that 2D strategies excel when semantic information accumulates predictably across sentences\.

#### Accuracy loss

The chosen loss of 2% balanced performance preservation with speedup across all tested configurations, while the loss of 4% favoured the 2D EE methods\. This is attributed to the fact that 2D EE provides much more possible exit points, resulting in a graceful accuracy decay when cutting both the number of processing layers and the processed part of the input\.

#### Fine\-tuning

Fine\-tuning consistently decreased the advantage of 2D EE over layer\-wise EE across all three datasets and all models tested\. We leave a deeper analysis of this phenomenon to the ablation study in Sect\.[6\.1](https://arxiv.org/html/2604.18592#S6.SS1)\.

## 6Discussion

### 6\.1Ablation study

To understand the individual contributions of different components in our 2D early exit framework, we conducted an ablation study on the Gemma\-3n\-E4B model using the MMS dataset\. We compared four configurations: \(i\) vanilla model with adapter\-only training and layer\-wise early exit, \(ii\) same model with 2D early exit, \(iii\) fine\-tuned model with layer\-wise early exit, and \(iv\) same model with 2D early exit\.

Table[5](https://arxiv.org/html/2604.18592#S6.T5)shows the results for accuracy loss of 2% \. The baseline configuration \(i\) achieves 2\.3×\\timesspeed\-up through layer\-wise early exit alone\. Adding sentence\-wise trimming in configuration \(ii\) increases the speed\-up to 4\.6×\\times, demonstrating that the 2D coordination provides a 2×\\timesmultiplicative benefit beyond layer\-wise exiting\. However, model fine\-tuning \(iii\) and \(iv\), while improving overall accuracy \(0\.78 vs\. 0\.74\), provides a speed\-up of 1\.8 / 1\.9 for layer\-wise / 2D early exit, respectively, reducing the multiplicative benefit of 2D EE to 1\.06×\\times\.

Table 5:Ablation study results on Gemma\-3n\-E4B with MMS dataset \(accuracy loss of 2%\)\.To analyse the role of fine\-tuning, we calculated heatmaps of accuracy along the layer and sentence axis for Gemma\-3n\-E4B, both vanilla and fine\-tuned \([1](https://arxiv.org/html/2604.18592#S1.F1)\) on an MMS sub\-dataset of samples with 10 sentences \(7100 samples\)\. Fig\.[4](https://arxiv.org/html/2604.18592#S6.F4)compares the average block\-wise accuracy curves based on these data for 2D EE progression blocks defined in Sect\.[3\.2](https://arxiv.org/html/2604.18592#S3.SS2)\. Rather surprisingly, while fine\-tuning increased accuracy in later layers and sentences, it simultaneously decreased it in earlier layers and sentences, despite the fact that the fine\-tuning loss function \([3](https://arxiv.org/html/2604.18592#S3.E3)\) progressively penalises loss in early layers and sentences\.

![Refer to caption](https://arxiv.org/html/2604.18592v1/acc_block_sums_Gemma.png)Figure 4:Average block accuracies of 2D early exit for Gemma\-3n\-E4B on the MMS sub\-dataset of samples with 10 sentences\.
### 6\.2Hyperparameter setting

The hyperparameters used during training include the weightsλi\\lambda\_\{i\}of the losses of individual layers in the aggregate loss function \([3](https://arxiv.org/html/2604.18592#S3.E3)\)\. While\[[5](https://arxiv.org/html/2604.18592#bib.bib13),[6](https://arxiv.org/html/2604.18592#bib.bib19)\]use a quadratic increase in weights towards higher layers, we experimentally verified that a simpler setting described in Section[3\.1](https://arxiv.org/html/2604.18592#S3.SS1)provided comparable results, indicating the robustness of the weights\.

Inference requires tuning two hyperparameters:τi​g​n​o​r​e\\tau\_\{ignore\}\(confidence threshold for accumulation\) andτa​c​c\\tau\_\{acc\}\(acceptance threshold for early stopping\)\. Analysis of the hyperparameter space, illustrated in Figure[6](https://arxiv.org/html/2604.18592#A2.F6), reveals a low sensitivity to hyperparameters: well\-behaved optimisation landscapes with broad and clear global maxima, making efficient search strategies viable\. A simple hill\-climbing or coarse\-to\-fine grid search can identify near\-optimal configurations within 20\-30 evaluations\. The optimal values tend to cluster predictably:τi​g​n​o​r​e\\tau\_\{ignore\}typically ranges from 0\.3 to 0\.5 \(filtering low\-confidence predictions\), whileτa​c​c\\tau\_\{acc\}scales with the complexity of the task\.

### 6\.3Comparison with other works

Our results align with and extend existing findings\. DeeBERT achieved≈40\\approx 40% inference reduction in the GLUE SST\-2 task with<4%<4\\%accuracy loss, in particular 44\-47% reduction in SST\-2 with≈2%\\approx 2\\%accuracy loss, comparable to our experiment with Steam dataset \(3\.2\-4\.3×\\timesspeed\-up, i\.e\., 69\-77% reduction,<<2% loss\)\. The comparison with LayerSkip showed an advantage of 2D EE≈1\.6×\\approx 1\.6\\timesfor the Steam dataset \(2% accuracy loss\) and≈1\.8×\\approx 1\.8\\timesfor MMS \(4% loss\)\. Unlike methods requiring architectural modifications \(LayerSkip, MatFormer\) or reinforcement learning \(ConsistEE, GREEN\-CODE\), our approach remains model\-agnostic and training\-efficient, requiring only lightweight classification adapters\. The key distinction is our simultaneous optimisation across both layer and token dimensions, which \-— as demonstrated \-— can provide multiplicative rather than additive benefits when task complexity permits early semantic convergence\.

### 6\.4Limitations

#### Datasets

We used three public datasets of customer reviews with different difficulty \(2 / 3 / 5 classes\)\. Consequently, the fine\-tuned models returned average test set accuracies of 0\.96 / 0\.79 / 0\.64, with only slight variations between models\. This choice of datasets allowed us to demonstrate how 2D EE efficiency depends on the complexity of the task\. We could not use some popular benchmarks such as GLUE, since longer inputs \(≥\\geq10 sentences\) are needed to demonstrate the 2D EE strength\. Experiments with datasets beyond sentiment classification will be needed to assess general applicability\.

#### Baselines

To avoid comparison with numerous previous works using various training and inference mechanisms, which necessarily leads to methodological inconsistency and incompleteness, we generalised layer\-wise EE studies to anoptimal early exitusing an exit layer with the best possible speed\-up \(Section[4](https://arxiv.org/html/2604.18592#S4)\)\. This allowed for a principal comparison of the 1D \(layer\-wise\) and 2D \(layer\-wise and token\-wise\) early exit philosophy\. In addition, we included a direct comparison of the results with an influential recent study\[[6](https://arxiv.org/html/2604.18592#bib.bib19)\]\.

#### Metrics

Early exit trades a substantial speed\-up \(and energy reduction\) for a small output quality drop\. The choice of accuracy metrics suffices for this purpose, as in other similar studies\. The inference speed\-up was expressed as the ratio of abstract operations \(processed layers/input sentences\) used in the full and the exiting model, which is machine\-dependent model\-agnostic\. We do not include wall\-clock timing results because a fair runtime evaluation would require an optimised custom CUDA kernel for the proposed early\-exit computation\. A naive implementation would not faithfully represent the method’s true efficiency, as the measured latency would be dominated by implementation overhead rather than by the underlying reduction in computation\. Developing such a kernel is outside the scope of this paper, so we restrict our evaluation to theoretical/computational analysis\.

### 6\.5Orthogonality with other methods

The 2D EE strategy can be combined with a range of complementary methods to further enhance efficiency\. One option is to integrate 2D EE withMixture of Experts\(MoE\) architectures\[[23](https://arxiv.org/html/2604.18592#bib.bib45)\], which dynamically route input through a sparse set of specialised sub\-networks\.Matryoshkarepresentation learning\[[12](https://arxiv.org/html/2604.18592#bib.bib36)\]organises representations into nested levels of complexity\. Among the models we tested, Gemma\-3n\-E4B uses the Matryoshka architecture, and therefore our results confirm its orthogonality with the 2D EE technique\. Similarly,quantization methods\[[3](https://arxiv.org/html/2604.18592#bib.bib24)\]can allow intermediate exits to operate with low precision when used alongside 2D EE\.Pruning techniques\[[18](https://arxiv.org/html/2604.18592#bib.bib25)\]can be applied to remove redundant weights or neurons\. Combined with 2D EE, pruning can further lower computational requirements by simplifying both the early and late stages of the model\. Finally,knowledge distillation\[[10](https://arxiv.org/html/2604.18592#bib.bib27)\]offers a compelling way to train intermediate exit layers\. By distilling knowledge from the full model into these layers, one can boost the accuracy of early predictions in both layer\-wise and token\-wise levels\.

## 7Conclusion and future work

We have introduced a novel two\-dimensional early exit strategy that simultaneously coordinates layer\-wise and sentence\-wise exiting for classification tasks in large language models\. By processing input incrementally sentence\-by\-sentence while progressively activating deeper layers, our method achieves computational savings that exceed those possible through layer\-wise or token\-wise optimisation alone\.

Our experimental evaluation across four state\-of\-the\-art LLMs \(Gemma\-3n 4B, Llama\-3\.1\-8B, Llama\-3\.2\-3B, Qwen\-2\.5 7B\) on three sentiment classification datasets demonstrates substantial additional speed\-ups against optimal layer\-wise early exit with vanilla models: up to 2\.3× for both binary classification and a three\-class problem, and up to 1\.5× for complex multi\-class problems \(Amazon\-5 with 5 classes\)\. The 2D advantage persists, though diminished, also with fine\-tuned models \(up to 1\.7× for two classes and up to 1\.3× for three classes\)\. However, the 2D advantage disappears on Llama models for the Amazon\-5 task, indicating that 2D strategies excel primarily when semantic information accumulates predictably across sentences\.

An intriguing finding is that fine\-tuning, while improving overall accuracy, paradoxically reduces the 2D early exit advantage by flattening accuracy variations across early layers and partial inputs\. This suggests an unexplored tension between optimising for final accuracy and preserving the gradient structures that enable effective early exiting\. The heatmaps in Fig\.[1](https://arxiv.org/html/2604.18592#S1.F1)and the graph in Fig\.[4](https://arxiv.org/html/2604.18592#S6.F4)illustrate this phenomenon, revealing that fine\-tuning increases accuracy in later processing stages while simultaneously decreasing it in earlier ones—despite the loss function explicitly encouraging early predictions\.

Future work should address several directions\. First, developing fine\-tuning strategies that explicitly preserve or enhance accuracy gradients across both dimensions could unlock the full potential of 2D early exit\. Techniques for stabilising layer\-wise semantics, such as GER\-steer\[[9](https://arxiv.org/html/2604.18592#bib.bib46)\], might be helpful\. Second, extending the approach beyond sentiment classification to tasks such as topic categorisation, intent detection, or question answering would test its generalisability\. Third, integrating reinforcement learning to enable adaptive, per\-sample layer skipping patterns could provide even more flexible resource allocation\. Finally, combining 2D early exit with orthogonal techniques such as quantisation, pruning, or mixture\-of\-experts architectures could yield multiplicative efficiency gains across multiple optimisation dimensions\.

## Acknowledgements

The authors thank Lou Nováků for carrying out experiments with ModernBERT\.

## Declarations

### Funding

This article was produced with the financial support of the European Union under the REFRESH – Research Excellence For REgion Sustainability and High\-tech Industries project number CZ\.10\.03\.01/00/22\_003/0000048 via the Operational Programme Just Transition, and under the: Biography of Fake News with a Touch of AI: Dangerous Phenomenon through the Prism of Modern Human Sciences project no\.: CZ\.02\.01\.01/00/23\_025/0008724 via the Operational Programme Jan Ámos Komenský\. It was also supported by the Silesian University in Opava under the Student Funding Plan, project SGS/9/2024\.

### Conflict of interest

The authors declare that they have no conflict of interest\.

### Data availability

## References

- \[1\]L\. Chen, M\. Zaharia, and J\. Zou\(2023\)FrugalGPT: how to use large language models while reducing cost and improving performance\.arXiv preprint arXiv:2305\.05176\.Cited by:[§2](https://arxiv.org/html/2604.18592#S2.p1.1)\.
- \[2\]Y\. Chen, X\. Pan, Y\. Li, B\. Ding, and J\. Zhou\(2023\)EE\-LLM: large\-scale training and inference of early\-exit large language models with 3D parallelism\.arXiv preprint arXiv:2312\.04916\.Cited by:[§2\.1](https://arxiv.org/html/2604.18592#S2.SS1.p4.4)\.
- \[3\]T\. Dettmers, A\. Pagnoni, A\. Holtzman, and L\. Zettlemoyer\(2023\)QLoRA: efficient finetuning of quantized LLMs\.Advances in Neural Information Processing Systems36,pp\. 10088–10115\.Cited by:[§2](https://arxiv.org/html/2604.18592#S2.p1.1),[§6\.5](https://arxiv.org/html/2604.18592#S6.SS5.p1.1)\.
- \[4\]F\. Devvrit, S\. Kudugunta, A\. Kusupati, T\. Dettmers, K\. Chen, I\. Dhillon, Y\. Tsvetkov, H\. Hajishirzi, S\. Kakade, A\. Farhadi,et al\.\(2024\)MatFormer: nested transformer for elastic inference\.Advances in Neural Information Processing Systems37,pp\. 140535–140564\.Cited by:[§2](https://arxiv.org/html/2604.18592#S2.p1.1)\.
- \[5\]M\. Elbayad, J\. Gu, E\. Grave, and M\. Auli\(2020\)Depth\-adaptive transformer\.InICLR 2020: Eighth International Conference on Learning Representations,pp\. 1–14\.Cited by:[§2\.1](https://arxiv.org/html/2604.18592#S2.SS1.p5.1),[§2\.1](https://arxiv.org/html/2604.18592#S2.SS1.p6.1),[§6\.2](https://arxiv.org/html/2604.18592#S6.SS2.p1.1)\.
- \[6\]M\. Elhoushi, A\. Shrivastava, D\. Liskovich, B\. Hosmer, B\. Wasti, L\. Lai, A\. Mahmoud, B\. Acun, S\. Agarwal, A\. Roman, A\. Aly, B\. Chen, and C\. Wu\(2024\)LayerSkip: enabling early exit inference and self\-speculative decoding\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 12622–12642\.External Links:[Link](http://dx.doi.org/10.18653/v1/2024.acl-long.681)Cited by:[§2\.1](https://arxiv.org/html/2604.18592#S2.SS1.p4.4),[§2\.1](https://arxiv.org/html/2604.18592#S2.SS1.p5.1),[§3\.1](https://arxiv.org/html/2604.18592#S3.SS1.p1.9),[§4](https://arxiv.org/html/2604.18592#S4.p1.1),[§4](https://arxiv.org/html/2604.18592#S4.p2.1),[§5\.4](https://arxiv.org/html/2604.18592#S5.SS4.p1.1),[§6\.2](https://arxiv.org/html/2604.18592#S6.SS2.p1.1),[§6\.4](https://arxiv.org/html/2604.18592#S6.SS4.SSSx2.p1.1)\.
- \[7\]S\. Fan, X\. Jiang, X\. Li, X\. Meng, P\. Han, S\. Shang, A\. Sun, Y\. Wang, and Z\. Wang\(2024\)Not all layers of LLMs are necessary during inference\.arXiv preprint arXiv:2403\.02181\.Cited by:[§2\.1](https://arxiv.org/html/2604.18592#S2.SS1.p3.2),[§4](https://arxiv.org/html/2604.18592#S4.p1.1)\.
- \[8\]S\. Ilager, L\. F\. Briem, and I\. Brandic\(2025\)GREEN\-CODE: optimizing energy efficiency in large language models for code generation\.External Links:2501\.11006,[Link](https://arxiv.org/abs/2501.11006)Cited by:[§2\.1](https://arxiv.org/html/2604.18592#S2.SS1.p4.4)\.
- \[9\]X\. Jiang, W\. Yu, D\. Wang, and L\. Hu\(2026\)Global evolutionary steering: refining activation steering control via cross\-layer consistency\.External Links:2603\.12298,[Link](https://arxiv.org/abs/2603.12298)Cited by:[§7](https://arxiv.org/html/2604.18592#S7.p4.1)\.
- \[10\]X\. Jiao, Y\. Yin, L\. Shang, X\. Jiang, X\. Chen, L\. Li, F\. Wang, and Q\. Liu\(2020\)TinyBERT: distilling BERT for natural language understanding\.InFindings of the Association for Computational Linguistics: EMNLP 2020,pp\. 4163–4174\.Cited by:[§2](https://arxiv.org/html/2604.18592#S2.p1.1),[§6\.5](https://arxiv.org/html/2604.18592#S6.SS5.p1.1)\.
- \[11\]D\. Khanna, A\. K\. Guru, S\. Sridhar, Z\. Ahmed, R\. Bahirwani, M\. Malhotra, V\. Jain, A\. Chadha, A\. Das, and K\. Ghosh\(2025\)QuickSilver – speeding up LLM inference through dynamic token halting, KV skipping, contextual token fusion, and adaptive matryoshka quantization\.External Links:2506\.22396,[Link](https://arxiv.org/abs/2506.22396)Cited by:[§2\.1](https://arxiv.org/html/2604.18592#S2.SS1.p5.1),[§2\.1](https://arxiv.org/html/2604.18592#S2.SS1.p6.1)\.
- \[12\]A\. Kusupati, G\. Bhatt, A\. Rege, M\. Wallingford, A\. Sinha, V\. Ramanujan, W\. Howard\-Snyder, K\. Chen, S\. Kakade, P\. Jain, and A\. Farhadi\(2022\)Matryoshka representation learning\.InAdvances in Neural Information Processing Systems,S\. Koyejo, S\. Mohamed, A\. Agarwal, D\. Belgrave, K\. Cho, and A\. Oh \(Eds\.\),Vol\.35,pp\. 30233–30249\.Cited by:[§2](https://arxiv.org/html/2604.18592#S2.p1.1),[§6\.5](https://arxiv.org/html/2604.18592#S6.SS5.p1.1)\.
- \[13\]S\. Laskaridis, A\. Kouris, and N\. D\. Lane\(2021\)Adaptive inference through early\-exit networks: design, challenges and directions\.InProceedings of the 5th International Workshop on Embedded and Mobile Deep Learning,pp\. 1–6\.Cited by:[§2\.1](https://arxiv.org/html/2604.18592#S2.SS1.p1.1)\.
- \[14\]X\. Li, C\. Lou, Y\. Chen, Z\. Zhu, Y\. Shen, Y\. Ma, and A\. Zou\(2023\)Predictive exit: prediction of fine\-grained early exits for computation\-and energy\-efficient inference\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.37,pp\. 8657–8665\.Cited by:[§2\.1](https://arxiv.org/html/2604.18592#S2.SS1.p3.2),[§4](https://arxiv.org/html/2604.18592#S4.p1.1)\.
- \[15\]F\. Liu, Y\. Tang, Z\. Liu, Y\. Ni, K\. Han, and Y\. Wang\(2024\)Kangaroo: lossless self\-speculative decoding via double early exiting\.arXiv preprint arXiv:2404\.18911\.Cited by:[§2\.1](https://arxiv.org/html/2604.18592#S2.SS1.p4.4),[§4](https://arxiv.org/html/2604.18592#S4.p1.1)\.
- \[16\]K\. Liu, Y\. Liu, X\. Yang, P\. Wang, W\. Zhang, S\. Feng, Y\. Zhang, and D\. Wang\(2026\)NEAT: neuron\-based early exit for large reasoning models\.arXiv preprint arXiv:2602\.02010\.Cited by:[§2\.1](https://arxiv.org/html/2604.18592#S2.SS1.p5.1)\.
- \[17\]W\. Liu, P\. Zhou, Z\. Zhao, Z\. Wang, H\. Deng, and Q\. Ju\(2020\)FastBERT: a self\-distilling BERT with adaptive inference time\.arXiv preprint arXiv:2004\.02178\.Cited by:[§2\.1](https://arxiv.org/html/2604.18592#S2.SS1.p2.6),[§2\.1](https://arxiv.org/html/2604.18592#S2.SS1.p5.1)\.
- \[18\]X\. Ma, G\. Fang, and X\. Wang\(2023\)LLM\-pruner: on the structural pruning of large language models\.Advances in Neural Information Processing Systems36,pp\. 21702–21720\.Cited by:[§2](https://arxiv.org/html/2604.18592#S2.p1.1),[§6\.5](https://arxiv.org/html/2604.18592#S6.SS5.p1.1)\.
- \[19\]H\. Rahmath P, V\. Srivastava, K\. Chaurasia, R\. G\. Pacheco, and R\. S\. Couto\(2024\)Early\-exit deep neural network: a comprehensive survey\.ACM Computing Surveys57\(3\),pp\. 1–37\.Cited by:[§2\.1](https://arxiv.org/html/2604.18592#S2.SS1.p1.1)\.
- \[20\]T\. Schuster, A\. Fisch, J\. Gupta, M\. Dehghani, D\. Bahri, V\. Tran, Y\. Tay, and D\. Metzler\(2022\)Confident adaptive language modeling\.Advances in Neural Information Processing Systems35,pp\. 17456–17472\.Cited by:[§2\.1](https://arxiv.org/html/2604.18592#S2.SS1.p5.1),[§2\.1](https://arxiv.org/html/2604.18592#S2.SS1.p6.1)\.
- \[21\]T\. Schuster, A\. Fisch, T\. Jaakkola, and R\. Barzilay\(2021\)Consistent accelerated inference via confident adaptive transformers\.arXiv preprint arXiv:2104\.08803\.Cited by:[§2\.1](https://arxiv.org/html/2604.18592#S2.SS1.p3.2),[§4](https://arxiv.org/html/2604.18592#S4.p1.1)\.
- \[22\]R\. Schwartz, G\. Stanovsky, S\. Swayamdipta, J\. Dodge, and N\. A\. Smith\(2020\)The right tool for the job: matching model and instance complexities\.arXiv preprint arXiv:2004\.07453\.Cited by:[§2\.1](https://arxiv.org/html/2604.18592#S2.SS1.p2.6)\.
- \[23\]N\. Shazeer, A\. Mirhoseini, K\. Maziarz, A\. Davis, Q\. Le, G\. Hinton, and J\. Dean\(2017\)Outrageously large neural networks: the sparsely\-gated mixture\-of\-experts layer\.arXiv preprint arXiv:1701\.06538\.Cited by:[§6\.5](https://arxiv.org/html/2604.18592#S6.SS5.p1.1)\.
- \[24\]S\. Teerapittayanon, B\. McDanel, and H\. Kung\(2016\)BranchyNet: fast inference via early exiting from deep neural networks\.In2016 23rd International Conference on Pattern Recognition \(ICPR\),pp\. 2464–2469\.Cited by:[§2\.1](https://arxiv.org/html/2604.18592#S2.SS1.p1.1)\.
- \[25\]J\. Xin, R\. Tang, J\. Lee, Y\. Yu, and J\. Lin\(2020\)DeeBERT: dynamic early exiting for accelerating BERT inference\.arXiv preprint arXiv:2004\.12993\.Cited by:[§2\.1](https://arxiv.org/html/2604.18592#S2.SS1.p2.6),[§3\.1](https://arxiv.org/html/2604.18592#S3.SS1.p1.9),[§4](https://arxiv.org/html/2604.18592#S4.p1.1)\.
- \[26\]C\. Yang, Q\. Si, Y\. Duan, Z\. Zhu, C\. Zhu, Q\. Li, M\. Chen, Z\. Lin, and W\. Wang\(2025\)Dynamic early exit in reasoning models\.External Links:2504\.15895,[Link](https://arxiv.org/abs/2504.15895)Cited by:[§2\.1](https://arxiv.org/html/2604.18592#S2.SS1.p5.1)\.
- \[27\]Z\. Zeng, Y\. Hong, H\. Dai, H\. Zhuang, and C\. Chen\(2024\)ConsistentEE: a consistent and hardness\-guided early exiting method for accelerating language models inference\.External Links:2312\.11882,[Link](https://arxiv.org/abs/2312.11882)Cited by:[§2\.1](https://arxiv.org/html/2604.18592#S2.SS1.p3.2)\.
- \[28\]W\. Zhou, C\. Xu, T\. Ge, J\. McAuley, K\. Xu, and F\. Wei\(2020\)BERT loses patience: fast and robust inference with early exit\.Advances in Neural Information Processing Systems33,pp\. 18330–18341\.Cited by:[§2\.1](https://arxiv.org/html/2604.18592#S2.SS1.p2.6),[§3\.1](https://arxiv.org/html/2604.18592#S3.SS1.p1.9)\.
- \[29\]W\. Zhu\(2021\)LeeBERT: learned early exit for BERT with cross\-level optimization\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\),pp\. 2968–2980\.Cited by:[§4](https://arxiv.org/html/2604.18592#S4.p1.1)\.

## Appendix ASplitting the text into sentences

The datasets contain one user contribution per sample, such as a product review, which must be split into individual sentences\. Due to the informal style of user reviews, simple splitting is insufficient, and regular expressions are required\. Our solution employs two regular expressions\. The first is a negative look\-behind, preventing splits at positions that are not genuine sentence boundaries\. Specifically, it excludes periods in abbreviations \(e\.g\. "Dr\." or "etc\."\) and multiple consecutive periods \("…"\)\. The second regular expression actively identifies true delimiters – a period followed by whitespace or an exclamation mark or question mark \(potentially repeated\) followed by whitespace\. In practice, consider the review "Excellent product\! It is great\. I recommend Dr\. Smith\." The algorithm correctly identifies sentence boundaries after "product\!" and "great\." but preserves "I recommend Dr\. Smith\." as a unit since the abbreviation "Dr\." satisfies the negative look\-behind conditions\.

## Appendix BAdditional explanatory results

Figure[5](https://arxiv.org/html/2604.18592#A2.F5)shows cosine distances between embeddings of consecutive layers in Llama\-3\.1\-8B model, in both vanilla and fine\-tuned version, using the MMS dataset testing split\. The local peaks visible between layers 13\-15 correspond to the optimal exit points with the 2% accuracy loss in Tables[2](https://arxiv.org/html/2604.18592#S5.T2)and[3](https://arxiv.org/html/2604.18592#S5.T3)\.

Figure[6](https://arxiv.org/html/2604.18592#A2.F6)displays the accuracy/speed\-up heatmap of the 2D early exit in Llama\-3\.1\-8B model in both vanilla version, using the MMS dataset testing split\. The axes are the hyperparametersτa​c​c\\tau\_\{acc\}\(accuracy threshold\) andτi​g​n​o​r​e\\tau\_\{ignore\}\(ignore threshold\)\. The heatmap show a well\-behaved and broad global accuracy maximum, suggesting fast convergence of various hyperparameter search methods\. The best speed\-up for the 2% accuracy loss is found forτa​c​c=27\.74,\\tau\_\{acc\}=27\.74,τi​g​n​o​r​e=0\.4\.\\tau\_\{ignore\}=0\.4\.

![Refer to caption](https://arxiv.org/html/2604.18592v1/x3.png)Figure 5:Cosine distances of embedding vectors between consecutive layers of Llama\-3\.1\-8B, averaged over the MMS testing split\.![Refer to caption](https://arxiv.org/html/2604.18592v1/x4.png)Figure 6:Accuracy and speedup heatmap of the 2D early exit strategy for Llama\-3\.1\-8B on the MMS dataset, no fine\-tuning\.

Similar Articles

Optimizing Korean-Centric LLMs via Token Pruning

arXiv cs.CL

This paper presents a systematic benchmark of token pruning—a compression technique that removes tokens and embeddings for irrelevant languages—applied to Korean-centric LLM tasks. The study evaluates popular multilingual models (Qwen3, Gemma-3, Llama-3, Aya) across different vocabulary configurations and finds that token pruning significantly improves generation stability and reduces memory footprint for domain-specific deployments.