RKSC: Reasoning-Aware KV Cache Sharing and Confident Early Exit for Multi-Step LLM Inference
Summary
Introduces RKSC, a training-free inference framework for multi-branch LLM reasoning that reduces KV cache redundancy via similarity-based sharing and early exit, achieving up to 3x speedup with minimal error.
View Cached Full Text
Cached at: 06/10/26, 06:19 AM
# Reasoning-Aware KV Cache Sharing and Confident Early Exit for Multi-Step LLM Inference
Source: [https://arxiv.org/html/2606.09937](https://arxiv.org/html/2606.09937)
###### Abstract
We introduceRKSC\(Reasoning\-Aware KV Cache Sharing\), a training\-free inference framework that eliminates two structural redundancies in multi\-branch LLM reasoning pipelines\.ASKS\(Attention\-Similarity KV Sharing\) computes the prefix KV cache once and broadcasts it to all semantically similar branches via hidden\-state cosine similarity, strictly generalising the token\-exact prefix caching used by vLLM and SGLang\.CGEE\(Confidence\-Gated Early Exit\) applies two complementary exit mechanisms: \(1\) it skips the verification forward pass entirely when generation confidence is decisive across branches, and \(2\) it terminates the verification pass at an intermediate layer when per\-layer entropy stabilises, using lightweight hooks on the transformer backbone\.RSBCM\(Reasoning\-Selective Block Cache Manager\) prevents unbounded cache growth via attention\-weighted depth\-priority eviction\. Across five model families \(7B–10B\), four benchmarks, and 1,000 evaluated problems, RKSC achieves a mean speedup of3\.008×\\mathbf\{3\.008\\times\}over the No\-KV baseline \(peak3\.990×\\mathbf\{3\.990\\times\}\), a1\.66×\\mathbf\{1\.66\\times\}mean improvement over vLLM\-equivalent prefix caching, with a CGEE\-induced error rate of only0\.37%0\.37\\%\(6 errors out of 1,616 verify calls\)\. No fine\-tuning or architecture changes are required\. Code is available at[https://github\.com/AnirudhSekar/RKSC](https://github.com/AnirudhSekar/RKSC)\.
Machine Learning, ICML, KV Cache, Inference Optimization, Reasoning
## 1Introduction
Inference\-time scaling systems, DeepSeek\-R1\(Guo et al\.,[2025](https://arxiv.org/html/2606.09937#bib.bib9)\), Qwen3\(An Yang et al\.,[2025](https://arxiv.org/html/2606.09937#bib.bib2)\), o1\-style models, generate multiple reasoning trajectories per problem, with the first trajectory correct in the majority of cases\(Chen et al\.,[2025b](https://arxiv.org/html/2606.09937#bib.bib5)\)\. This multi\-branch paradigm incurs two structural redundancies that no production system addresses\.\(1\) Cross\-branch KV waste:parallel branches share the majority of their prefix KV computation yet recompute it independently, in a Tree of Thoughts\(Yao et al\.,[2023](https://arxiv.org/html/2606.09937#bib.bib26)\)run \(branching 4, depth 3\), branches diverging at depth 2 share66%66\\%of KV work by construction\. vLLM\(Kwon et al\.,[2023](https://arxiv.org/html/2606.09937#bib.bib13)\)and SGLang\(Zheng et al\.,[2024](https://arxiv.org/html/2606.09937#bib.bib27)\)reuse KV blocks on exact token matches only, leaving all cross\-branch similarity unexploited\.\(2\) Over\-deep verification:process\-reward verification\(Lightman et al\.,[2023](https://arxiv.org/html/2606.09937#bib.bib15)\)runs full\-depth forward passes even when the model is already confident; no system exploits per\-layer entropy collapse to exit early*within*a single verify pass\.
We introduceRKSC\(Reasoning\-Aware KV Cache Sharing\), a training\-free inference framework eliminating both redundancies\. Our contributions are:
- •ASKS\(§[2\.2](https://arxiv.org/html/2606.09937#S2.SS2)\): gates prefix KV reuse across branches via*hidden\-state cosine similarity*with exponentially weighted later\-layer emphasis, strictly generalising token\-exact caching\. Achieves28\.6%28\.6\\%additional reuse where token\-exact matching yields0%0\\%\.
- •CGEE\(§[2\.3](https://arxiv.org/html/2606.09937#S2.SS3)\): dual\-level exit combining \(a\)*verify\-skip*when generation confidence is decisive, and \(b\)*layer\-level entropy exit*terminating the verify pass when per\-layer logit entropy stabilises\.
KV caching\(Vaswani et al\.,[2023](https://arxiv.org/html/2606.09937#bib.bib21)\)and block\-level prefix reuse \(vLLM\(Kwon et al\.,[2023](https://arxiv.org/html/2606.09937#bib.bib13)\), SGLang\(Zheng et al\.,[2024](https://arxiv.org/html/2606.09937#bib.bib27)\)\) require byte\-identical token prefixes; ASKS decouples reuse from lexical identity\. MemShare\(Chen et al\.,[2025a](https://arxiv.org/html/2606.09937#bib.bib4)\)deduplicates KV states*within*a single chain via step\-delimiter segmentation; ASKS operates*across*parallel branches\. Layer\-wise early exit has been applied to BERT\(Xin et al\.,[2020](https://arxiv.org/html/2606.09937#bib.bib23)\)and generation\(Schuster et al\.,[2022](https://arxiv.org/html/2606.09937#bib.bib18); Elhoushi et al\.,[2024](https://arxiv.org/html/2606.09937#bib.bib8)\); for reasoning,\(Yang et al\.,[2025](https://arxiv.org/html/2606.09937#bib.bib25)\)exit*between*chains at transition tokens, a token\-level mechanism orthogonal to CGEE’s within\-pass layer\-level exit\. Self\-consistency\(Wang et al\.,[2023](https://arxiv.org/html/2606.09937#bib.bib22)\)and process\-reward methods\(Lightman et al\.,[2023](https://arxiv.org/html/2606.09937#bib.bib15)\)motivate multi\-branch generation; RKSC accelerates any such pipeline as a drop\-in wrapper, without altering the reasoning algorithm\.
A complementary line of inference acceleration uses a small draft model to propose candidate tokens, which a larger target model then verifies in parallel\.\(Leviathan et al\.,[2023](https://arxiv.org/html/2606.09937#bib.bib14)\)establish that this approach preserves the output distribution exactly while achieving 2–3× wall\-clock speedup, and SpecInfer\(Miao et al\.,[2024](https://arxiv.org/html/2606.09937#bib.bib16)\)generalises it to tree\-structured candidate sets verified in a single batched forward pass\. Medusa\(Cai et al\.,[2024](https://arxiv.org/html/2606.09937#bib.bib3)\)removes the need for a separate draft model entirely by appending lightweight decoding heads to the backbone, predicting multiple future tokens in parallel and verifying them via tree attention\. RKSC’s CGEE is complementary to all three: whereas speculative methods accelerate the decode phase by parallelising token proposals, CGEE accelerates the verification phase of process\-reward scoring by gating or truncating the verify forward pass based on per\-layer entropy collapse\.
## 2RKSC Pipeline
RKSC accelerates multi\-branch reasoning inference through three complementary mechanisms: \(1\)KV prefix sharingeliminates redundant prefill computation across branches, \(2\)CGEEreduces or eliminates the verification forward pass, and \(3\)RSBCMmanages cache capacity under deep tree search\. Figure[1](https://arxiv.org/html/2606.09937#S2.F1)illustrates the full pipeline\. The design exploits structure that already exists in the inference workload, shared prefixes, concentrated generation confidence, entropy collapse in later layers, rather than introducing new computation\. Each mechanism is independently switchable and adds zero learnable parameters\.
Input ProblemRoot Forward:𝒞,𝐇root\\mathcal\{C\},\\mathbf\{H\}\_\{\\text\{root\}\}ASKS Gate:σb≥τ?\\sigma\_\{b\}\\geq\\tau\\;?B1B\_\{1\}B2B\_\{2\}⋯\\cdotsBNB\_\{N\}shared𝒞B\\mathcal\{C\}\_\{B\}Gen\. Confidencep\(b\)p^\{\(b\)\}CGEE: skip\|\|early\-exit\|\|fullFullVerifyEarly\-Exitat layerl∗l^\{\*\}VerifySkipFinal Answer
Figure 1:RKSC pipeline overview\.A single root forward computes the shared KV cache𝒞\\mathcal\{C\}and root hidden states\. ASKS broadcasts𝒞\\mathcal\{C\}only to branches with similarityσb≥τ\\sigma\_\{b\}\\geq\\tau\. AllBBbranches decode in one batched call while accumulating generation confidence\. CGEE selects among three paths:*verify skip*\(confidence decisive\),*early\-exit verify*\(entropy stabilised atl∗l^\{\*\}\), or*full verify*\.### 2\.1Problem Setting
We consider an LLMfθf\_\{\\theta\}withLLtransformer layers solving a problemxxby generatingBBreasoning branches from a shared problem prefix\. Each branchb∈\{1,…,B\}b\\in\\\{1,\\ldots,B\\\}begins from a shared context𝐜=prefix\(x\)\\mathbf\{c\}=\\texttt\{prefix\}\(x\)ofnntokens and appends a short, branch\-specific suffix \(a reasoning hint ofs≪ns\\ll ntokens\) before decodingttanswer tokens autoregressively\. A*verify*pass then scores allBBbranches and selects the best answer\. The standard inference procedure, which we call the*No\-KV*baseline, batches allBBbranches in a single forward pass but recomputes the fullnn\-token prefix attention independently for each branch, payingO\(Bn2\)O\(Bn^\{2\}\)attention cost\. A stronger baseline, the*vLLM\-equivalent*, performs token\-exact prefix matching: the root KV cache is reused only when branch tokens share a byte\-identical prefix with the root, mirroring the PagedAttention prefix\-cache path in vLLM\(Kwon et al\.,[2023](https://arxiv.org/html/2606.09937#bib.bib13)\)and SGLang\(Zheng et al\.,[2024](https://arxiv.org/html/2606.09937#bib.bib27)\)\. This baseline captures all gains that existing production systems achieve and serves as our primary comparison point\.
Algorithm[1](https://arxiv.org/html/2606.09937#alg1)presents the top\-level RKSC solve procedure\. Detailed per\-mechanism pseudocode appears in Appendix[B](https://arxiv.org/html/2606.09937#A2)\.
Algorithm 1RKSC top\-level solve0:prefix
𝐜\\mathbf\{c\}, suffixes
\{𝐬b\}b=1B\\\{\\mathbf\{s\}\_\{b\}\\\}\_\{b=1\}^\{B\}, thresholds
\(τ,τconf,rgap,θ,ϵ\)\(\\tau,\\tau\_\{\\text\{conf\}\},r\_\{\\text\{gap\}\},\\theta,\\epsilon\)
1:
𝒞,𝐇root←fθprefill\(𝐜\)\\mathcal\{C\},\\mathbf\{H\}\_\{\\text\{root\}\}\\leftarrow f\_\{\\theta\}^\{\\text\{prefill\}\}\(\\mathbf\{c\}\)\{single root forward \(§[2\.2](https://arxiv.org/html/2606.09937#S2.SS2)\)\}
2:
sharedSet←AsksGate\(𝐇root,\{𝐬b\},τ\)\\text\{sharedSet\}\\leftarrow\\textsc\{AsksGate\}\(\\mathbf\{H\}\_\{\\text\{root\}\},\\\{\\mathbf\{s\}\_\{b\}\\\},\\tau\)\{Alg\.[2](https://arxiv.org/html/2606.09937#alg2)\}
3:
𝒞B←repeat\_interleave\(𝒞,B\)\\mathcal\{C\}\_\{B\}\\leftarrow\\texttt\{repeat\\\_interleave\}\(\\mathcal\{C\},B\);
𝒞B\.clone\(\)\\mathcal\{C\}\_\{B\}\.\\texttt\{clone\(\)\}
4:
\{𝐲b\},\{p\(b\)\}←BatchedDecode\(\{𝐬b\},𝒞B\)\\\{\\mathbf\{y\}\_\{b\}\\\},\\\{p^\{\(b\)\}\\\}\\leftarrow\\textsc\{BatchedDecode\}\(\\\{\\mathbf\{s\}\_\{b\}\\\},\\mathcal\{C\}\_\{B\}\)\{track top\-1 probs\}
5:if
maxbp\(b\)≥τconf\\max\_\{b\}p^\{\(b\)\}\\geq\\tau\_\{\\text\{conf\}\}andgap\-ratio
≥rgap\\geq r\_\{\\text\{gap\}\}then
6:return
argmaxbp\(b\)\\arg\\max\_\{b\}p^\{\(b\)\}\{verify\-skip, Level 1\}
7:endif
8:return
VerifyWithEntropyExit\(\{𝐲b\},θ,ϵ\)\\textsc\{VerifyWithEntropyExit\}\(\\\{\\mathbf\{y\}\_\{b\}\\\},\\theta,\\epsilon\)\{Alg\.[3](https://arxiv.org/html/2606.09937#alg3), Level 2\}
### 2\.2ASKS: Attention\-Similarity KV Sharing
#### Prefix forward pass\.
RKSC begins each solve call with a single prefix forward pass through allLLlayers, storing the resulting KV states and hidden representations:
𝐊\(l\),𝐕\(l\),𝐡root\(l\)=fθ\(≤l\)\(𝐜\),l=1,…,L,\\mathbf\{K\}^\{\(l\)\},\\mathbf\{V\}^\{\(l\)\},\\,\\mathbf\{h\}^\{\(l\)\}\_\{\\text\{root\}\}=f\_\{\\theta\}^\{\(\\leq l\)\}\(\\mathbf\{c\}\),\\quad l=1,\\ldots,L,\(1\)yielding the cache𝒞=\{\(𝐊\(l\),𝐕\(l\)\)\}l=1L\\mathcal\{C\}=\\\{\(\\mathbf\{K\}^\{\(l\)\},\\mathbf\{V\}^\{\(l\)\}\)\\\}\_\{l=1\}^\{L\}and root hidden states𝐇root=\{𝐡root\(l\)\}\\mathbf\{H\}\_\{\\text\{root\}\}=\\\{\\mathbf\{h\}^\{\(l\)\}\_\{\\text\{root\}\}\\\}\. To avoid redundant computation during theBBsubsequent branch comparisons, root hidden states are pre\-normalised to unit vectors once and stored in this normalised form\.
#### Exponentially weighted similarity gating\.
After each branchbbprocesses its suffix tokens, ASKS computes a weighted cosine similarity between the branch’s hidden states and the stored root representations\. The weighting schedule places exponentially increasing emphasis on later layers, which carry more task\-specific information and are therefore more discriminative for detecting genuine semantic divergence:
σb=∑l=1Lwl⋅𝐡b\(l\)⋅𝐡root\(l\)‖𝐡b\(l\)‖‖𝐡root\(l\)‖,\\sigma\_\{b\}=\\sum\_\{l=1\}^\{L\}w\_\{l\}\\cdot\\frac\{\\mathbf\{h\}^\{\(l\)\}\_\{b\}\\cdot\\mathbf\{h\}^\{\(l\)\}\_\{\\text\{root\}\}\}\{\\\|\\mathbf\{h\}^\{\(l\)\}\_\{b\}\\\|\\,\\\|\\mathbf\{h\}^\{\(l\)\}\_\{\\text\{root\}\}\\\|\},\(2\)wherewl=exp\(αl/L\)/∑l′exp\(αl′/L\)w\_\{l\}=\\exp\(\\alpha l/L\)\\big/\\sum\_\{l^\{\\prime\}\}\\exp\(\\alpha l^\{\\prime\}/L\)withα=1\.5\\alpha\{=\}1\.5\. We selectα=1\.5\\alpha\{=\}1\.5by grid search overα∈\{0\.5,1\.0,1\.5,2\.0\}\\alpha\\in\\\{0\.5,1\.0,1\.5,2\.0\\\}on 30 held\-out GPQA Diamond problems; larger values over\-discount early layers and reduce sensitivity to prefix divergence, while smaller values under\-weight task\-specific later layers\.
Branches withσb≥τ\\sigma\_\{b\}\\geq\\tauare deemed semantically consistent with the prefix and receive the shared cache; branches belowτ\\taufall back to independent recomputation\. This design strictly generalises token\-exact prefix caching: any lexically identical prefix trivially achievesσb=1\\sigma\_\{b\}=1, so every branch that vLLM or SGLang would reuse is also reused by ASKS\. The converse does not hold: ASKS additionally reuses the cache for branches whose token sequences differ but whose hidden\-state representations remain close to the root\. In our diverse\-phrasing stress test \(§[4\.1](https://arxiv.org/html/2606.09937#S4.SS1)\), this generalisation yields28\.6%28\.6\\%additional reuse beyond token\-exact matching\.
#### KV broadcast and batched decode\.
For branches that pass the similarity gate, the shared cache is expanded to batch sizeBBvia a zero\-copyrepeat\_interleaveon each layer’s key and value tensors:𝒞B=repeat\_interleave\(𝒞,B,dim=0\)\\mathcal\{C\}\_\{B\}=\\texttt\{repeat\\\_interleave\}\(\\mathcal\{C\},B,\\text\{dim\}\{=\}0\)\. An explicit\.clone\(\)call ensures contiguous memory layout, preventing potential SDPA kernel failures with non\-contiguous inputs\. AllBBbranch suffixes are then stacked into a single batched tensor and decoded in one forward pass conditioned on𝒞B\\mathcal\{C\}\_\{B\}:𝐲b=fθ\(𝐬b∣𝒞B\)\\mathbf\{y\}\_\{b\}=f\_\{\\theta\}\(\\mathbf\{s\}\_\{b\}\\mid\\mathcal\{C\}\_\{B\}\),b=1,…,Bb=1,\\ldots,B, eliminating theO\(n2\)O\(n^\{2\}\)attention cost forB−1B\{\-\}1branches\.
#### KV probe\.
KV prefix sharing is not unconditionally beneficial: at short prefixes, the cost of the prefix forward pass plus therepeat\_interleaveexpansion can exceed the savings\. Rather than relying on the analytical thresholdn∗n^\{\*\}from Proposition[A\.1](https://arxiv.org/html/2606.09937#A1.Thmhheorem1)\(which requires accurate per\-platform estimates of the cost constants\), RKSC includes a self\-calibrating runtime probe\. At the first solve call within each 64\-token prefix\-length bucketk=⌊n/64⌋k=\\lfloor n/64\\rfloor, the probe times three paths using the minimum of 3 repetitions: \(i\) batched full recompute, \(ii\) a single prefix forward, and \(iii\) batched suffix decode with the shared cache\. KV sharing is declared beneficial iff the sum of paths \(ii\) and \(iii\) is less than path \(i\)\. On A100 with SDPA attention, the probe universally declares KV sharing beneficial atn≥512n\\geq 512for all five evaluated models\.
### 2\.3CGEE: Confidence\-Gated Early Exit
CGEE operates at two levels\. The first level decides whether to run the verification pass at all; the second level decides how deeply to run it when it proceeds\.
#### Generation confidence \(input to Level 1\)\.
During the stepwise decode loop, RKSC records the top\-1 softmax probability at each decode step for each branch\. The generation confidence of branchbbis the mean top\-1 probability across allttsteps:
p\(b\)=1t∑j=1tmaxvPr\[yj\(b\)=v\],p^\{\(b\)\}=\\frac\{1\}\{t\}\\sum\_\{j=1\}^\{t\}\\max\_\{v\}\\Pr\[y\_\{j\}^\{\(b\)\}=v\],\(3\)accumulated in\-place at zero additional compute cost, since the softmax is already computed to select the next token at every step\. A branch that emits each token with high probability is a strong candidate for the correct answer; branches with uniformly low top\-1 probabilities indicate uncertainty\.
#### Verify\-skip gate \(Level 1\)\.
After decoding completes, RKSC evaluates whether to skip the verification pass entirely\. The gate fires when two conditions are jointly satisfied: the winning branch has high absolute confidence, and it dominates all competing branches by a clear relative margin:
maxbp\(b\)≥τconfandmaxbp\(b\)−2nd\-maxbp\(b\)maxbp\(b\)≥rgap\.\\max\_\{b\}\\,p^\{\(b\)\}\\geq\\tau\_\{\\text\{conf\}\}\\;\\;\\text\{and\}\\;\\;\\frac\{\\max\_\{b\}p^\{\(b\)\}\-\\text\{2nd\-max\}\_\{b\}\\,p^\{\(b\)\}\}\{\\max\_\{b\}p^\{\(b\)\}\}\\geq r\_\{\\text\{gap\}\}\.\(4\)The dual condition prevents two failure modes: \(a\) a single branch with moderately high confidence but a close competitor \(the absolute threshold alone would fire\), and \(b\) a large relative gap between two branches that both have low absolute confidence \(the gap condition alone would fire\)\. When the gate fires, the entireδ\\delta\-cost verify pass is skipped and branches are ranked directly byp\(b\)p^\{\(b\)\}\.
#### Layer\-level entropy exit \(Level 2\)\.
When the verify\-skip gate does not fire and the full verification forward pass begins, CGEE installs lightweight forward hooks on each transformer layer viaregister\_forward\_hook\. At each layerll, the hook extracts the last\-token hidden state, projects it through the cached unembedding matrixWuW\_\{u\}\(stored once at model load time\), and computes the logit\-space entropy:
H\(l\)=−∑vsoftmax\(𝐡\(l\)Wu⊤\)vlogsoftmax\(𝐡\(l\)Wu⊤\)v\.H^\{\(l\)\}=\-\\\!\\sum\_\{v\}\\text\{softmax\}\(\\mathbf\{h\}^\{\(l\)\}W\_\{u\}^\{\\top\}\)\_\{v\}\\,\\log\\text\{softmax\}\(\\mathbf\{h\}^\{\(l\)\}W\_\{u\}^\{\\top\}\)\_\{v\}\.\(5\)The verification pass exits early at layerl∗l^\{\*\}when three conditions are jointly satisfied: \(i\)l∗≥lminl^\{\*\}\\geq l\_\{\\min\}\(minimum exit depth, default 2, ensuring sufficient representation refinement\); \(ii\)H\(l∗\)<θH^\{\(l^\{\*\}\)\}<\\theta\(entropy below a model\-specific threshold, indicating a concentrated logit distribution\); and \(iii\)\|H\(l∗\)−H\(l∗−1\)\|<ϵ\|H^\{\(l^\{\*\}\)\}\-H^\{\(l^\{\*\}\{\-\}1\)\}\|<\\epsilon\(entropy has stabilised between consecutive layers, ruling out transient dips\)\. When the exit triggers, the hook raises an internal exception that the solver catches, using the partial output as the verification result\. Hooks are removed after each solve call to prevent memory leaks\.
On Qwen2\.5\-7B \(28 layers\), the layer\-level exit fires on100%100\\%of verify passes at a mean exit layer of 18\.4, confirming that the model reaches a stable logit distribution approximately34%34\\%before the final layer\. This validates the core premise of Level 2: the unembedding projection at intermediate layers contains sufficient information to determine the verification outcome, and the remaining34%34\\%of layers are computationally wasteful for this purpose\.
#### Calibration\.
The thresholdsτconf\\tau\_\{\\text\{conf\}\},rgapr\_\{\\text\{gap\}\}, andθ\\thetaare calibrated per model on 30 held\-out GPQA Diamond problems \(non\-overlapping with evaluation\)\.τconf\\tau\_\{\\text\{conf\}\}is set to the 75th percentile of the observedmaxbp\(b\)\\max\_\{b\}p^\{\(b\)\}distribution;rgapr\_\{\\text\{gap\}\}to the 25th percentile of the observed relative gap; andθ\\thetato the median of per\-layer entropy values at the first stabilisation point\. This yields conservative gates that fire only on clear\-cut cases\.
### 2\.4RSBCM: Reasoning\-Selective Block Cache Manager
To prevent the shared KV cache from growing unboundedly under deep tree search, RKSC includesRSBCM\. Each cached block receives an importance scoreω=branch score/\(depth\+1\)\\omega=\\text\{branch score\}/\(\\text\{depth\}\+1\), which prioritises blocks from high\-scoring branches at shallower depths\. When the number of allocated blocks exceeds a configurable capacity \(default 2,000 blocks\), the manager evicts blocks in ascendingω\\omegaorder\. The depth denominator ensures that blocks deep in the search tree, which are less likely to be revisited, are evicted first, while the branch\-score numerator preserves blocks associated with higher\-confidence reasoning trajectories\.
In a stress test withmax\_blocks=4<B=8\\text\{max\\\_blocks\}\{=\}4<B\{=\}8\(chosen to force evictions within a single solve call\), RSBCM fires exactlyB−4=4B\-4=4evictions per problem \(80 total across 20 problems, matching the expected count exactly\)\. The overhead is\+1\.5\+1\.5ms/problem \(95% CI\[−6,\+9\]\[\-6,\+9\]ms\); the wide interval withN=20N\{=\}20means the overhead is consistent with zero but not provably so, the correct interpretation is that it is small relative to the speedups achieved\. Answer agreement is 100% across all 20 problems\. RSBCM is dormant in all single\-depth experiments in this paper \(total block count never exceeds the default threshold of 2,000\) and is provided as a mechanism for multi\-depth tree search\. Full validation is in Appendix[H](https://arxiv.org/html/2606.09937#A8)\.
Full implementation details \(DynamicCache, decode loop, TF32 configuration, memory release\) are in Appendix[I](https://arxiv.org/html/2606.09937#A9)\.
## 3Experimentation and Results
### 3\.1Experimental Setup
#### Hardware and software\.
All experiments run on a single NVIDIA A100\-80GB SXM4 GPU with bfloat16 precision, TF32 matrix multiplication enabled, and the SDPA attention backend\. The software stack consists of PyTorch 2\.3\.1, Transformers 4\.44\.0, and CUDA 12\.1\. Random seeds are fixed across all experiments \(torch\.manual\_seed\(42\),numpy\.random\.seed\(42\),torch\.cuda\.manual\_seed\_all\(42\)\)\.
#### Models\.
We evaluate five publicly available GQA models spanning 7B–10B parameters and three distinct GQA compression ratios:Qwen2\.5\-7B\-Instruct\(Team,[2024a](https://arxiv.org/html/2606.09937#bib.bib19); Yang et al\.,[2024](https://arxiv.org/html/2606.09937#bib.bib24)\)\(28 layers, 28 query heads, 4 KV heads; 7:1 ratio\),Mistral\-7B\-Instruct\-v0\.3\(Jiang et al\.,[2023](https://arxiv.org/html/2606.09937#bib.bib12)\)\(32L, 32Q, 8KV; 4:1\),Falcon3\-7B\-Instruct\(Team,[2024b](https://arxiv.org/html/2606.09937#bib.bib20)\)\(28L, 12Q, 4KV; 3:1\),Falcon3\-10B\-Instruct\(Team,[2024b](https://arxiv.org/html/2606.09937#bib.bib20)\)\(40L, 12Q, 4KV; 3:1\), andLlama\-3\-8B\-Instruct\(AI@Meta,[2024](https://arxiv.org/html/2606.09937#bib.bib1)\)\(32L, 32Q, 8KV; 4:1\)\. This selection covers three training lineages \(Qwen, Mistral/Meta, TII\), two vocabulary sizes \(32K and 128K\+\), and a range of head configurations, ensuring that observed gains are not artifacts of a single architecture\.
#### Baselines\.
We compare against two baselines\. The*No\-KV*baseline batches allBBbranches in a single forward pass but recomputes the shared prefix independently for each branch, representing a well\-engineered system without prefix caching\. The*vLLM\-equivalent*baseline implements token\-exact prefix matching: it reuses the shared KV cache only for branches whose token sequences are byte\-identical to the root prefix, mirroring the PagedAttention prefix\-cache path in vLLM\(Kwon et al\.,[2023](https://arxiv.org/html/2606.09937#bib.bib13)\)and SGLang\(Zheng et al\.,[2024](https://arxiv.org/html/2606.09937#bib.bib27)\)\. In our identical\-prefix experimental regime \(where all branches share the same padded prefix\), every branch is byte\-identical to the root, so the vLLM\-equivalent reuses the cache unconditionally, the same behaviour as settingτ=0\\tau\{=\}0in the ASKS gate\. Outside this regime \(the diverse\-phrasing ablation in §[4\.1](https://arxiv.org/html/2606.09937#S4.SS1)\), vLLM\-equivalent achieves0%0\\%reuse by construction\. Both baselines are measured through the same solver code path as RKSC to ensure timing fairness\.
#### Calibration and evaluation sets\.
The 30 GPQA Diamond problems used for per\-model calibration \(thresholdsτconf\\tau\_\{\\text\{conf\}\},rgapr\_\{\\text\{gap\}\},θ\\theta\) are drawn from a held\-out split disjoint from all evaluation sets, including the 30 problems in Table[1](https://arxiv.org/html/2606.09937#S3.T1)\.
#### Branching and padding\.
All experiments useB=8B\{=\}8branches with eight fixed reasoning\-hint suffixes cycled across branches\. Problems are padded to approximately 1,024 tokens using a neutral filler string prepended iteratively, placing all experiments in the KV\-beneficial regime \(n/n∗≈11n/n^\{\*\}\\approx 11, Proposition[A\.1](https://arxiv.org/html/2606.09937#A1.Thmhheorem1)\)\.
#### Datasets\.
We evaluate on four benchmarks spanning a difficulty gradient:GPQA Diamond\(Rein et al\.,[2023](https://arxiv.org/html/2606.09937#bib.bib17)\)\(graduate\-level science\),MMLU\-STEM\(Hendrycks et al\.,[2021b](https://arxiv.org/html/2606.09937#bib.bib11),[a](https://arxiv.org/html/2606.09937#bib.bib10)\)\(university\-level STEM\),ARC\-Challenge\(Clark et al\.,[2018](https://arxiv.org/html/2606.09937#bib.bib6)\)\(medium science multiple\-choice\), andGSM8K\(Cobbe et al\.,[2021](https://arxiv.org/html/2606.09937#bib.bib7)\)\(elementary mathematics\)\. Table[1](https://arxiv.org/html/2606.09937#S3.T1)usesN=30N\{=\}30GPQA problems witht=32t\{=\}32decode steps; the extended evaluation \(Table[2](https://arxiv.org/html/2606.09937#S3.T2)\) usesN=50N\{=\}50per dataset witht=8t\{=\}8\.
#### Timing protocol\.
Each condition is timed overNruns=2N\_\{\\text\{runs\}\}=2–3 repeats per problem; the reported latency is the mean across repeats and problems\. Two warmup solves per condition are executed and discarded to stabilise the CUDA allocator before timing begins, reducing the coefficient of variation from∼30%\{\\sim\}30\\%to∼10%\{\\sim\}10\\%\.torch\.cuda\.synchronize\(\)is called before every wall\-clock measurement\.
#### Compute budget\.
All experiments together consume approximately 42 A100\-GPU\-hours: 3 hours for calibration \(55models×\\times3030held\-out problems×\\times3 conditions\), 24 hours for the extended evaluation \(Table[2](https://arxiv.org/html/2606.09937#S3.T2)\), 9 hours for ablations \(§[4](https://arxiv.org/html/2606.09937#S4)\), and 6 hours for the throughput decomposition att=32t\{=\}32\(Table[1](https://arxiv.org/html/2606.09937#S3.T1)\) and the verify\-isolation study\. Model weights are cached locally; no training occurs\.
### 3\.2Throughput Decomposition
Table[1](https://arxiv.org/html/2606.09937#S3.T1)isolates the contribution of each RKSC component on GPQA Diamond using Qwen2\.5\-7B\-Instruct witht=32t\{=\}32decode steps\. This longer decode regime makes the decode termBtγBt\\gammanon\-negligible relative to the prefill term, providing a conservative estimate of gains\.
Table 1:Latency decomposition on GPQA Diamond \(N=30N\{=\}30,B=8B\{=\}8, prefix≈\\approx1,024 tok,t=32t\{=\}32\)\.KV prefix sharing alone saves 500 ms per problem \(25\.5%25\.5\\%\) with a bootstrap 95% CI of\[1\.339,1\.347\]\[1\.339,1\.347\]and a coefficient of variation of1%1\\%, confirming high measurement reliability\. This saving is almost entirely attributable to prefix attention: atn=1,024n\{=\}1\{,\}024, the batched prefill ofB=8B\{=\}8branches dominates decode cost, and RKSC replacesB−1B\{\-\}1of those prefills with a single suffix forward conditioned on the shared cache\.
The combined RKSC condition \(KV\+CGEE\) achieves1\.622×1\.622\\times\(\[1\.486,1\.803\]\[1\.486,1\.803\]\), saving 750 ms per problem\. The elevated variance \(CV=28%\\text\{CV\}\{=\}28\\%\) is expected and reflects the binary nature of the CGEE skip decision, which produces a bimodal latency distribution: problems on which the gate fires land on the fast skip\-path, while the rest match the KV\-only condition\. This bimodality inflates the standard deviation relative to the mean and does not indicate measurement instability, the KV\-sharing component, which accounts for the majority of the speedup, has CV=1%\{=\}1\\%\. Att=32t\{=\}32decode steps on GPQA Diamond \(the hardest dataset\), the verify\-skip gate fires infrequently, so CGEE’s marginal contribution is modest in this decomposition\. The full CGEE contribution is visible in the extended evaluation att=8t\{=\}8\(§[3\.3](https://arxiv.org/html/2606.09937#S3.SS3)\), where verify cost constitutes a larger fraction of total latency\.
#### Operating\-regime disclosure\.
The extended evaluation usest=8t\{=\}8decode steps, which is representative of the first branching phase in tree search and is a favourable operating point for RKSC because a short decode keeps the verify\-pass cost proportionally large\. The throughput decomposition att=32t\{=\}32\(Table[1](https://arxiv.org/html/2606.09937#S3.T1)\) and the verify\-only isolation below provide complementary views at longer decode horizons where CGEE’s contribution is naturally smaller\.
#### CGEE verify\-only isolation\.
To measure the layer\-level entropy exit in isolation, we time the verification pass alone in a decode\-dominated regime \(t=128t\{=\}128,N=15N\{=\}15\)\. Full\-depth verification takes315\.3±8\.2315\.3\\pm 8\.2ms; CGEE\-accelerated verification takes250\.7±11\.8250\.7\\pm 11\.8ms, yielding a verify\-pass speedup of1\.258×\\mathbf\{1\.258\\times\}\(bootstrap 95% CI\[1\.225,1\.293\]\[1\.225,1\.293\],p<0\.0001p\{<\}0\.0001\)\. The layer\-level entropy exit fires on100%100\\%of branches at a mean exit layer of 18\.4 out of 28, confirming that the model reaches a stable logit distribution well before the final layer on verification inputs\. This validates the core premise of Level 2: the unembedding projection at intermediate layers contains sufficient information to determine the verification outcome\.
### 3\.3Multi\-Model, Multi\-Dataset Evaluation
Table[2](https://arxiv.org/html/2606.09937#S3.T2)reports the full extended evaluation\. Att=8t\{=\}8decode steps \(representative of the first branching phase in tree search, where decode cost is small relative to prefill\), RKSC achieves a mean speedup of3\.008×\\mathbf\{3\.008\\times\}across all 20 model–dataset pairs, with a peak of3\.990×\\mathbf\{3\.990\\times\}\(Llama\-3\-8B on MMLU\-STEM\)\.
Table 2:Extended evaluation \(B=8B\{=\}8, prefix≈\\approx1,024 tok,t=8t\{=\}8,N=50N\{=\}50per dataset, 1,000 total problems\)\.vLLM↑\\uparrow: vLLM\-equiv vs No\-KV\.RKSC↑\\uparrow: KV\+CGEE vs No\-KV\.CGEE skip: verify\-skip gate fire rate\.ModelDatasetNoKV\(ms\)vLLM\(ms\)RKSC\(ms\)vLLM↑\\uparrowRKSC↑\\uparrowCGEE skipQwen2\.5\-7BInstructGPQA1,1596535981\.78×1\.78\\times1\.94×1\.94\\times3%GSM8K1,1636525151\.78×1\.78\\times2\.26×2\.26\\times32%ARC\-C1,1636513911\.79×1\.79\\times2\.97×2\.97\\times71%MMLU1,1606493711\.79×1\.79\\times3\.13×3\.13\\times70%Mistral\-7BInstructGPQA1,2576905901\.82×1\.82\\times2\.13×2\.13\\times23%GSM8K1,2626895811\.83×1\.83\\times2\.17×2\.17\\times30%ARC\-C1,2636914821\.83×1\.83\\times2\.62×2\.62\\times60%MMLU1,2606925341\.82×1\.82\\times2\.36×2\.36\\times47%Falcon3\-7BInstructGPQA1,1586373601\.82×1\.82\\times3\.21×3\.21\\times54%GSM8K1,1706503761\.80×1\.80\\times3\.11×3\.11\\times50%ARC\-C1,1736513471\.80×1\.80\\times3\.38×3\.38\\times66%MMLU1,1756513951\.80×1\.80\\times2\.98×2\.98\\times39%Falcon3\-10BInstructGPQA1,6298954501\.82×1\.82\\times3\.63×3\.63\\times72%GSM8K1,6348984731\.82×1\.82\\times3\.46×3\.46\\times57%ARC\-C1,6288974181\.82×1\.82\\times3\.89×3\.89\\times77%MMLU1,6348984671\.82×1\.82\\times3\.50×3\.50\\times66%Llama\-3\-8BInstructGPQA1,2807085091\.81×1\.81\\times2\.51×2\.51\\times57%GSM8K1,2837104641\.81×1\.81\\times2\.76×2\.76\\times65%ARC\-C1,2827124011\.80×1\.80\\times3\.20×3\.20\\times74%MMLU1,2847103221\.81×1\.81\\times3\.99×3\.99\\times92%Overall1,3017194521\.808×\\mathbf\{1\.808\\times\}3\.008×\\mathbf\{3\.008\\times\}55%#### Architecture generalization\.
The vLLM\-equivalent gain spans1\.78×1\.78\\times–1\.83×1\.83\\timesacross all 20 model–dataset pairs \(mean1\.808×1\.808\\times, std0\.020\.02\), confirming the prediction of Eq\. \([8](https://arxiv.org/html/2606.09937#A1.E8)\): sinceSKVS\_\{\\text\{KV\}\}depends onnnandBBbut not on model family or task, fixing the prefix length and branching factor produces near\-identical gains regardless of architecture\. This near\-constant behaviour persists despite substantial architectural variation, five models spanning three GQA compression ratios \(3:1, 4:1, 7:1\), two vocabulary sizes, and three independent training lineages\. The tight clustering \(±0\.02×\\pm 0\.02\\times\) confirms that RKSC’s KV throughput benefit transfers uniformly across model families without requiring a high\-compression GQA architecture\.
At then=1,024n\{=\}1\{,\}024operating point, all branches achieveσb≥0\.95\\sigma\_\{b\}\\geq 0\.95for all models, yielding100%100\\%KV reuse\. In this identical\-prefix regime, ASKS and the vLLM\-equivalent produce identical KV gains by construction; the genuine ASKS novelty over token\-exact caching is isolated in §[4\.1](https://arxiv.org/html/2606.09937#S4.SS1)\.
#### Effect of task difficulty on CGEE\.
CGEE multiplies the KV\-only speedup from1\.81×1\.81\\timesto3\.008×3\.008\\timeson average, a1\.66×\\mathbf\{1\.66\\times\}marginal improvement attributable entirely to the dual\-level exit mechanism\. The verify\-skip rateρ^\\hat\{\\rho\}tracks difficulty monotonically across the four datasets: lowest on GPQA Diamond \(mean∼42%\{\\sim\}42\\%across models, where graduate\-level reasoning produces high inter\-branch uncertainty\), and highest on ARC\-Challenge and MMLU\-STEM \(6060–92%92\\%, where shorter problems allow the model to concentrate confidence rapidly\)\. This difficulty\-adaptive behaviour follows directly from the dual condition in Eq\. \([4](https://arxiv.org/html/2606.09937#S2.E4)\): harder problems produce smaller inter\-branch gaps and lower absolute confidence, so both conditions are less likely to be satisfied simultaneously\.
#### Effect of model scale\.
Falcon3\-10B achieves the highest mean RKSC speedup \(3\.619×3\.619\\times\), compared to2\.574×2\.574\\times–3\.171×3\.171\\timesfor the 7B–8B models\. The absolute latency saved also scales: 1,179 ms per problem for Falcon3\-10B versus 692–799 ms for smaller models\. This scaling follows from the latency model \(Eq\. \([6](https://arxiv.org/html/2606.09937#A1.E6)\)\): larger models have higher per\-layer costs, so a fixed number of layers avoided by prefix sharing and early exit represents a proportionally larger saving\. Llama\-3\-8B achieves the highest single\-cell result \(3\.990×\\mathbf\{3\.990\\times\}on MMLU\-STEM with a 92% skip rate\), demonstrating that models excluded from calibration benefit substantially from RKSC\.
#### RKSC vs vLLM\-equivalent\.
The marginal speedup of RKSC over vLLM\-equivalent prefix caching averages\+61\.2%\+61\.2\\%across all 20 model–dataset pairs\. In the identical\-prefix regime, this gain is attributable entirely to CGEE’s dual\-level exit, since ASKS and vLLM\-equivalent yield identical KV gains when all branches share the same token prefix\. The genuine ASKS novelty is isolated in §[4\.1](https://arxiv.org/html/2606.09937#S4.SS1)\.
### 3\.4Accuracy Analysis
Table 3:CGEE accuracy impact across all 20 model–dataset pairs \(1,616 total verify calls\)\.Across 1,616 verify calls spanning all five models and four datasets, CGEE introduced only 6 errors \(0\.37%0\.37\\%\), with a mean accuracy delta of−0\.2%\-0\.2\\%relative to full\-depth verification \(Table[3](https://arxiv.org/html/2606.09937#S3.T3)\)\. Three model–dataset pairs account for all six errors: Mistral\-7B on ARC\-Challenge \(3 errors,−4\.0%\-4\.0\\%Δ\\Deltaaccuracy\), Falcon3\-10B on ARC\-Challenge \(2 errors,−2\.0%\-2\.0\\%\), and Llama\-3\-8B on GSM8K \(1 error,−2\.0%\-2\.0\\%\)\. The remaining 17 model–dataset pairs exhibit zero CGEE\-induced errors\.
The error concentration in multiple\-choice benchmarks \(ARC\-C\) is consistent with the structure of the CGEE gate: on multiple\-choice items, generation confidence reflects the model’s fluency in producing the answer letter, which occasionally diverges from the correctness signal that full\-depth verification captures\. On free\-form generation tasks \(GPQA, GSM8K\), where generation confidence more directly reflects understanding, the error rate is lower\. These results confirm that the dual condition in Eq\. \([4](https://arxiv.org/html/2606.09937#S2.E4)\), requiring both high absolute confidence and a clear relative margin, is an effective safeguard against false positives\.
## 4Analysis and Ablations
### 4\.1ASKS Novelty Over vLLM Prefix Caching
The experiments in §[3\.3](https://arxiv.org/html/2606.09937#S3.SS3)use an identical\-prefix regime where both ASKS and vLLM\-equivalent caching achieve100%100\\%KV reuse and identical speedups\. To isolate the contribution of semantic similarity gating, we construct a diverse\-phrasing regime where branch\-level token sequences differ from the first token\.
We design eight phrasing templates at four semantic\-divergence levels \(identical, minor rewording, structural reframing, and very different academic register\) and evaluateN=15N\{=\}15problems on Qwen2\.5\-7B\. Token\-exact caching achieves0%0\\%reuse on all non\-identical phrasings by construction, since branches diverge at token position 1\. ASKS achieves28\.6%28\.6\\%additional reuse atτ=0\.82\\tau\{=\}0\.82, correctly identifying branches whose hidden\-state representations remain semantically consistent despite surface\-level lexical divergence\.
The wall\-clock implication is significant: on padded diverse phrasings \(n≈512n\\approx 512\), ASKS achieves a1\.131×1\.131\\timesspeedup over the No\-KV baseline \(bootstrap 95% CI\[1\.066,1\.174\]\[1\.066,1\.174\],p=0\.0007p\{=\}0\.0007\), while token\-exact caching provides no benefit\. This result demonstrates that ASKS offers a strict superset of the functionality provided by existing prefix\-caching systems, with measurable throughput gains in the regime where those systems fail\.
### 4\.2Prefix\-Length Sensitivity
To characterise the operating envelope of KV sharing, we sweep the prefix length overn∈\{128,256,512,1024\}n\\in\\\{128,256,512,1024\\\}tokens withB=8B\{=\}8andN=20N\{=\}20problems \(Qwen2\.5\-7B\)\. Atn=128n\{=\}128, the speedup is0\.993×0\.993\\times\(\[0\.909,1\.103\]\[0\.909,1\.103\]\), confirming no net benefit at short prefixes where therepeat\_interleaveoverhead approaches the saved prefill cost\. Atn=256n\{=\}256the result is similarly break\-even \(0\.999×0\.999\\times,\[0\.951,1\.061\]\[0\.951,1\.061\]\)\. Atn=512n\{=\}512, KV sharing is clearly beneficial \(1\.127×1\.127\\times,\[1\.081,1\.224\]\[1\.081,1\.224\]\), and atn=1,024n\{=\}1\{,\}024the speedup reaches1\.452×1\.452\\times\(\[1\.301,1\.681\]\[1\.301,1\.681\]\)\.
The empirical transition from break\-even to beneficial betweenn=256n\{=\}256andn=512n\{=\}512is higher than the analytical thresholdn∗≈90n^\{\*\}\\approx 90from Proposition[A\.1](https://arxiv.org/html/2606.09937#A1.Thmhheorem1)\. This gap reflects constant\-factor SDPA kernel overhead that the linear latency model does not capture: the actual prefix forward pass includes fixed\-cost setup steps \(kernel launch, memory allocation\) that are non\-negligible at short prefix lengths but amortise atn≥512n\\geq 512\. The KV probe \(§[2\.2](https://arxiv.org/html/2606.09937#S2.SS2)\) handles this discrepancy correctly by measuring both paths at runtime rather than relying on the analytical threshold\.
### 4\.3ASKS Threshold Insensitivity
In the identical\-prefix regime, all branches achieveσb≥0\.95\\sigma\_\{b\}\\geq 0\.95for all five models, so the ASKS gate is non\-binding regardless of thresholdτ\\tau\. To confirm this, we sweepτ∈\{0\.70,0\.82,0\.90,0\.95\}\\tau\\in\\\{0\.70,0\.82,0\.90,0\.95\\\}atn≈512n\\approx 512withN=20N\{=\}20problems: all four settings produce speedups of∼1\.16×\{\\sim\}1\.16\\times\(p<10−25p<10^\{\-25\}\), confirming that the gate decision isτ\\tau\-insensitive when the prefix is long relative to the branch suffix\. This robustness is practically important: practitioners need not tuneτ\\taufor their specific workload when operating atn≥512n\\geq 512\.
### 4\.4CGEE Skip Rate vs Task Difficulty
The CGEE verify\-skip rate tracks task difficulty monotonically across the four benchmarks \(averaged across all five models\): GPQA Diamond42%42\\%, GSM8K47%47\\%, MMLU\-STEM63%63\\%, ARC\-Challenge70%70\\%\. This ordering matches the expected difficulty gradient and follows directly from the gate design: on hard problems, the model distributes probability mass more uniformly across branches, causing both conditions in Eq\. \([4](https://arxiv.org/html/2606.09937#S2.E4)\) to fail more often\. On easy problems, one branch dominates early in the decode sequence, both conditions are satisfied, and the gate fires\. CGEE therefore provides its largest speedup contribution precisely on the tasks where verification is least necessary, a self\-correcting property that limits accuracy risk\.
## 5Limitations
Sample size\.Our evaluation usesN=50N\{=\}50problems per dataset, producing wide 95% confidence intervals on CGEE skip rates \(±10\\pm 10–18%18\\%\)\. A sample ofN≥200N\\geq 200would reduce the CI to±5%\\pm 5\\%\. Importantly, the KV sharing gains \(which account for the majority of the speedup\) haveCV<2%\\text\{CV\}\{<\}2\\%and are not affected by this limitation\.
Model scale\.We evaluate 7B–10B models on a single A100 GPU\. Models at 14B\+ and multi\-GPU tensor\-parallel configurations introduce communication overhead that may alter the relative contribution of prefix sharing versus verification cost\. The latency model \(Eq\. \([6](https://arxiv.org/html/2606.09937#A1.E6)\)\) predicts larger models should benefit*more*, and the Falcon3\-10B results \(3\.62×3\.62\\timesmean vs2\.572\.57–3\.17×3\.17\\timesfor 7B models\) provide early evidence for this trend\.
Domain coverage\.Our benchmarks cover science and mathematics\. The CGEE entropy thresholdθ\\thetais calibrated on GPQA Diamond and may not transfer to domains with substantially different entropy profiles\. Per\-datasetθ\\thetacalibration via a small held\-out sweep \(5–10 problems\) restores gate activity, at the cost of one additional calibration step\.
CGEE error rate\.While low \(0\.37%0\.37\\%, 6/1,616\), CGEE is not error\-free\. All six errors occur in the verify\-skip gate \(Level 1\); the layer\-level entropy exit \(Level 2\) preserves the full verification outcome by construction\. Applications requiring zero verification errors should disable Level 1 and retain only Level 2, accepting a smaller speedup contribution\.
Scaling of ASKS broadcast\.Therepeat\_interleaveexpansion isO\(BLdk\)O\(BLd\_\{k\}\)in memory wheredkd\_\{k\}is the per\-head KV dimension; for very largeBB\(≥64\\geq 64\) on long prefixes, this expansion can itself become non\-negligible on memory\-bound workloads\. A zero\-copy view\-based implementation \(rather than the safer\.clone\(\)\) would remove this cost at the expense of additional care around SDPA kernel input contiguity\.
Hook\-based exit is architecture\-coupled\.CGEE’s Level 2 exit depends onregister\_forward\_hooksemantics and the existence of a distinct unembedding matrix accessible at model\-load time\. Models with tied input/output embeddings or non\-standard layer modules may require minor adapter code\. We verified compatibility with all five evaluated models; extending to MoE or fused\-layer architectures is future work\.
Concurrent solves\.RSBCM’s eviction policy assumes a single solve\-call owner of the cache\. In a multi\-tenant inference server, concurrent solves would require a tenant\-aware importance score; this is supported by the current design but not empirically validated here\.
## 6Conclusion
We presentedRKSC, a training\-free inference acceleration framework for multi\-branch LLM reasoning that addresses two structural redundancies in current inference pipelines\. ASKS broadcasts a single shared prefix KV cache to all semantically consistent branches via exponentially weighted cosine similarity, generalising token\-exact prefix caching to the semantic regime and demonstrating28\.6%28\.6\\%additional KV reuse on lexically diverse branches\. CGEE reduces verification cost through a dual\-level exit: a verify\-skip gate that bypasses the entire verification pass when generation confidence is decisive, and a layer\-level entropy exit that terminates the verification forward pass at an intermediate transformer layer when per\-layer logit entropy stabilises\. Together, these mechanisms achieve a mean speedup of3\.008×\\mathbf\{3\.008\\times\}\(peak3\.990×\\mathbf\{3\.990\\times\}\) across five model families and four benchmarks, with a CGEE\-induced error rate of0\.37%0\.37\\%\. The KV sharing component is architecture\-agnostic \(std0\.02×0\.02\\timesacross five models\), while the CGEE component is difficulty\-adaptive \(skip rates from42%42\\%on graduate\-level science to92%92\\%on university STEM\)\. RKSC requires no fine\-tuning, no architecture modifications, and no custom CUDA kernels\.
## Impact Statement
This paper advances the efficiency of large language model inference for multi\-step reasoning tasks\. By reducing redundant computation in multi\-branch reasoning pipelines, RKSC lowers the energy cost of inference\-time scaling and makes state\-of\-the\-art reasoning more accessible on constrained hardware\. We are not aware of societal consequences specific to this work beyond those general to advancing the field of Machine Learning and Large Language Models \(LLMs\)\.
## References
- AI@Meta \(2024\)AI@Meta\.Llama 3 model card\.2024\.URL[https://github\.com/meta\-llama/llama3/blob/main/MODEL\_CARD\.md](https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md)\.
- An Yang et al\. \(2025\)An Yang, A\. L\., Yang, B\., Zhang, B\., Hui, B\., Zheng, B\., Yu, B\., Gao, C\., Huang, C\., Lv, C\., Zheng, C\., Liu, D\., Zhou, F\., Huang, F\., Hu, F\., Ge, H\., Wei, H\., Lin, H\., Tang, J\., Yang, J\., Tu, J\., Zhang, J\., Yang, J\., Yang, J\., Zhou, J\., Zhou, J\., Lin, J\., Dang, K\., Bao, K\., Yang, K\., Yu, L\., Deng, L\., Li, M\., Xue, M\., Li, M\., Zhang, P\., Wang, P\., Zhu, Q\., Men, R\., Gao, R\., Liu, S\., Luo, S\., Li, T\., Tang, T\., Yin, W\., Ren, X\., Wang, X\., Zhang, X\., Ren, X\., Fan, Y\., Su, Y\., Zhang, Y\., Zhang, Y\., Wan, Y\., Liu, Y\., Wang, Z\., Cui, Z\., Zhang, Z\., Zhou, Z\., and Qiu, Z\.Qwen3 technical report, 2025\.URL[https://arxiv\.org/abs/2505\.09388](https://arxiv.org/abs/2505.09388)\.
- Cai et al\. \(2024\)Cai, T\., Li, Y\., Geng, Z\., Peng, H\., Lee, J\. D\., Chen, D\., and Dao, T\.Medusa: Simple llm inference acceleration framework with multiple decoding heads, 2024\.URL[https://arxiv\.org/abs/2401\.10774](https://arxiv.org/abs/2401.10774)\.
- Chen et al\. \(2025a\)Chen, K\., Tan, X\., Yu, M\., and Xu, H\.Memshare: Memory efficient inference for large reasoning models through kv cache reuse, 2025a\.URL[https://arxiv\.org/abs/2507\.21433](https://arxiv.org/abs/2507.21433)\.
- Chen et al\. \(2025b\)Chen, X\., Xu, J\., Liang, T\., He, Z\., Pang, J\., Yu, D\., Song, L\., Liu, Q\., Zhou, M\., Zhang, Z\., Wang, R\., Tu, Z\., Mi, H\., and Yu, D\.Do not think that much for 2\+3=? on the overthinking of o1\-like llms, 2025b\.URL[https://arxiv\.org/abs/2412\.21187](https://arxiv.org/abs/2412.21187)\.
- Clark et al\. \(2018\)Clark, P\., Cowhey, I\., Etzioni, O\., Khot, T\., Sabharwal, A\., Schoenick, C\., and Tafjord, O\.Think you have solved question answering? try arc, the ai2 reasoning challenge\.*arXiv:1803\.05457v1*, 2018\.
- Cobbe et al\. \(2021\)Cobbe, K\., Kosaraju, V\., Bavarian, M\., Chen, M\., Jun, H\., Kaiser, L\., Plappert, M\., Tworek, J\., Hilton, J\., Nakano, R\., Hesse, C\., and Schulman, J\.Training verifiers to solve math word problems\.*arXiv preprint arXiv:2110\.14168*, 2021\.
- Elhoushi et al\. \(2024\)Elhoushi, M\., Shrivastava, A\., Liskovich, D\., Hosmer, B\., Wasti, B\., Lai, L\., Mahmoud, A\., Acun, B\., Agarwal, S\., Roman, A\., Aly, A\., Chen, B\., and Wu, C\.\-J\.Layerskip: Enabling early exit inference and self\-speculative decoding\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pp\. 12622–12642\. Association for Computational Linguistics, 2024\.doi:10\.18653/v1/2024\.acl\-long\.681\.URL[http://dx\.doi\.org/10\.18653/v1/2024\.acl\-long\.681](http://dx.doi.org/10.18653/v1/2024.acl-long.681)\.
- Guo et al\. \(2025\)Guo, D\. et al\.Deepseek\-r1 incentivizes reasoning in llms through reinforcement learning\.*Nature*, 645\(8081\):633–638, September 2025\.ISSN 1476\-4687\.doi:10\.1038/s41586\-025\-09422\-z\.URL[http://dx\.doi\.org/10\.1038/s41586\-025\-09422\-z](http://dx.doi.org/10.1038/s41586-025-09422-z)\.
- Hendrycks et al\. \(2021a\)Hendrycks, D\., Burns, C\., Basart, S\., Critch, A\., Li, J\., Song, D\., and Steinhardt, J\.Aligning ai with shared human values\.*Proceedings of the International Conference on Learning Representations \(ICLR\)*, 2021a\.
- Hendrycks et al\. \(2021b\)Hendrycks, D\., Burns, C\., Basart, S\., Zou, A\., Mazeika, M\., Song, D\., and Steinhardt, J\.Measuring massive multitask language understanding\.*Proceedings of the International Conference on Learning Representations \(ICLR\)*, 2021b\.
- Jiang et al\. \(2023\)Jiang, A\. Q\., Sablayrolles, A\., Mensch, A\., Bamford, C\., Chaplot, D\. S\., de las Casas, D\., Bressand, F\., Lengyel, G\., Lample, G\., Saulnier, L\., Lavaud, L\. R\., Lachaux, M\.\-A\., Stock, P\., Scao, T\. L\., Lavril, T\., Wang, T\., Lacroix, T\., and Sayed, W\. E\.Mistral 7b, 2023\.URL[https://arxiv\.org/abs/2310\.06825](https://arxiv.org/abs/2310.06825)\.
- Kwon et al\. \(2023\)Kwon, W\., Li, Z\., Zhuang, S\., Sheng, Y\., Zheng, L\., Yu, C\. H\., Gonzalez, J\. E\., Zhang, H\., and Stoica, I\.Efficient memory management for large language model serving with pagedattention, 2023\.URL[https://arxiv\.org/abs/2309\.06180](https://arxiv.org/abs/2309.06180)\.
- Leviathan et al\. \(2023\)Leviathan, Y\., Kalman, M\., and Matias, Y\.Fast inference from transformers via speculative decoding, 2023\.URL[https://arxiv\.org/abs/2211\.17192](https://arxiv.org/abs/2211.17192)\.
- Lightman et al\. \(2023\)Lightman, H\., Kosaraju, V\., Burda, Y\., Edwards, H\., Baker, B\., Lee, T\., Leike, J\., Schulman, J\., Sutskever, I\., and Cobbe, K\.Let’s verify step by step, 2023\.URL[https://arxiv\.org/abs/2305\.20050](https://arxiv.org/abs/2305.20050)\.
- Miao et al\. \(2024\)Miao, X\., Oliaro, G\., Zhang, Z\., Cheng, X\., Wang, Z\., Zhang, Z\., Wong, R\. Y\. Y\., Zhu, A\., Yang, L\., Shi, X\., Shi, C\., Chen, Z\., Arfeen, D\., Abhyankar, R\., and Jia, Z\.Specinfer: Accelerating large language model serving with tree\-based speculative inference and verification\.In*Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3*, ASPLOS ’24, pp\. 932–949\. ACM, April 2024\.doi:10\.1145/3620666\.3651335\.URL[http://dx\.doi\.org/10\.1145/3620666\.3651335](http://dx.doi.org/10.1145/3620666.3651335)\.
- Rein et al\. \(2023\)Rein, D\., Hou, B\. L\., Stickland, A\. C\., Petty, J\., Pang, R\. Y\., Dirani, J\., Michael, J\., and Bowman, S\. R\.Gpqa: A graduate\-level google\-proof q&a benchmark, 2023\.URL[https://arxiv\.org/abs/2311\.12022](https://arxiv.org/abs/2311.12022)\.
- Schuster et al\. \(2022\)Schuster, T\., Fisch, A\., Gupta, J\., Dehghani, M\., Bahri, D\., Tran, V\. Q\., Tay, Y\., and Metzler, D\.Confident adaptive language modeling, 2022\.URL[https://arxiv\.org/abs/2207\.07061](https://arxiv.org/abs/2207.07061)\.
- Team \(2024a\)Team, Q\.Qwen2\.5: A party of foundation models, September 2024a\.URL[https://qwenlm\.github\.io/blog/qwen2\.5/](https://qwenlm.github.io/blog/qwen2.5/)\.
- Team \(2024b\)Team, T\.The falcon 3 family of open models, December 2024b\.
- Vaswani et al\. \(2023\)Vaswani, A\., Shazeer, N\., Parmar, N\., Uszkoreit, J\., Jones, L\., Gomez, A\. N\., Kaiser, L\., and Polosukhin, I\.Attention is all you need, 2023\.URL[https://arxiv\.org/abs/1706\.03762](https://arxiv.org/abs/1706.03762)\.
- Wang et al\. \(2023\)Wang, X\., Wei, J\., Schuurmans, D\., Le, Q\., Chi, E\., Narang, S\., Chowdhery, A\., and Zhou, D\.Self\-consistency improves chain of thought reasoning in language models, 2023\.URL[https://arxiv\.org/abs/2203\.11171](https://arxiv.org/abs/2203.11171)\.
- Xin et al\. \(2020\)Xin, J\., Tang, R\., Lee, J\., Yu, Y\., and Lin, J\.Deebert: Dynamic early exiting for accelerating bert inference, 2020\.URL[https://arxiv\.org/abs/2004\.12993](https://arxiv.org/abs/2004.12993)\.
- Yang et al\. \(2024\)Yang, A\., Yang, B\., Hui, B\., Zheng, B\., Yu, B\., Zhou, C\., Li, C\., Li, C\., Liu, D\., Huang, F\., Dong, G\., Wei, H\., Lin, H\., Tang, J\., Wang, J\., Yang, J\., Tu, J\., Zhang, J\., Ma, J\., Xu, J\., Zhou, J\., Bai, J\., He, J\., Lin, J\., Dang, K\., Lu, K\., Chen, K\., Yang, K\., Li, M\., Xue, M\., Ni, N\., Zhang, P\., Wang, P\., Peng, R\., Men, R\., Gao, R\., Lin, R\., Wang, S\., Bai, S\., Tan, S\., Zhu, T\., Li, T\., Liu, T\., Ge, W\., Deng, X\., Zhou, X\., Ren, X\., Zhang, X\., Wei, X\., Ren, X\., Fan, Y\., Yao, Y\., Zhang, Y\., Wan, Y\., Chu, Y\., Liu, Y\., Cui, Z\., Zhang, Z\., and Fan, Z\.Qwen2 technical report\.*arXiv preprint arXiv:2407\.10671*, 2024\.
- Yang et al\. \(2025\)Yang, C\., Si, Q\., Duan, Y\., Zhu, Z\., Zhu, C\., Li, Q\., Chen, M\., Lin, Z\., and Wang, W\.Dynamic early exit in reasoning models, 2025\.URL[https://arxiv\.org/abs/2504\.15895](https://arxiv.org/abs/2504.15895)\.
- Yao et al\. \(2023\)Yao, S\., Yu, D\., Zhao, J\., Shafran, I\., Griffiths, T\. L\., Cao, Y\., and Narasimhan, K\.Tree of thoughts: Deliberate problem solving with large language models, 2023\.URL[https://arxiv\.org/abs/2305\.10601](https://arxiv.org/abs/2305.10601)\.
- Zheng et al\. \(2024\)Zheng, L\., Yin, L\., Xie, Z\., Sun, C\., Huang, J\., Yu, C\. H\., Cao, S\., Kozyrakis, C\., Stoica, I\., Gonzalez, J\. E\., Barrett, C\., and Sheng, Y\.Sglang: Efficient execution of structured language model programs, 2024\.URL[https://arxiv\.org/abs/2312\.07104](https://arxiv.org/abs/2312.07104)\.
The appendix is organised as follows\. Appendix[A](https://arxiv.org/html/2606.09937#A1)contains the full theoretical analysis: the latency model, KV sharing speedup and threshold proposition, the combined RKSC speedup corollary, and the dual\-level CGEE derivation\. Appendix[B](https://arxiv.org/html/2606.09937#A2)gives detailed pseudocode for the three mechanisms referenced in §[2](https://arxiv.org/html/2606.09937#S2)\. Appendix[C](https://arxiv.org/html/2606.09937#A3)collects all hyperparameters and the full prefix\-length sensitivity sweep\. Appendix[D](https://arxiv.org/html/2606.09937#A4)provides per\-model accuracy breakdowns complementing Table[3](https://arxiv.org/html/2606.09937#S3.T3)\. Appendix[E](https://arxiv.org/html/2606.09937#A5)lists the per\-model cost constants used in the latency model\. Appendix[F](https://arxiv.org/html/2606.09937#A6)shows qualitative examples of CGEE firing and deferring\. Appendix[G](https://arxiv.org/html/2606.09937#A7)analyses per\-layer entropy dynamics underlying Level\-2 exit\. Appendix[H](https://arxiv.org/html/2606.09937#A8)documents the RSBCM deep\-tree validation experiment\. Appendix[I](https://arxiv.org/html/2606.09937#A9)gives implementation details\. Appendix[J](https://arxiv.org/html/2606.09937#A10)provides a full reproducibility checklist\.
## Appendix ATheoretical Analysis
We characterise the latency savings of RKSC analytically, derive the conditions under which each mechanism is beneficial, and show that the combined speedup decomposes into factors operating on disjoint cost terms\. The main\-text discussion of results in §[3\.3](https://arxiv.org/html/2606.09937#S3.SS3)and the prefix\-length ablation in §[4\.2](https://arxiv.org/html/2606.09937#S4.SS2)refer to Proposition[A\.1](https://arxiv.org/html/2606.09937#A1.Thmhheorem1)and Eq\. \([8](https://arxiv.org/html/2606.09937#A1.E8)\) from this appendix\.
### A\.1Latency Model
Consider generatingBBindependent reasoning branches from a shared problem prefix ofnntokens, each producingttdecode tokens\. We decompose wall\-clock latency into four additive terms:
ℒ=αn2⏟prefill\+Bβn⏟branch overhead\+Btγ⏟decode\+δ⏟verify,\\mathcal\{L\}=\\underbrace\{\\alpha n^\{2\}\}\_\{\\text\{prefill\}\}\+\\underbrace\{B\\,\\beta n\}\_\{\\text\{branch overhead\}\}\+\\underbrace\{B\\,t\\,\\gamma\}\_\{\\text\{decode\}\}\+\\underbrace\{\\delta\}\_\{\\text\{verify\}\},\(6\)whereα\\alphais the per\-token\-squared self\-attention cost \(dominant for long prefixes due to the quadratic attention mechanism\),β\\betais the per\-token feed\-forward cost,γ\\gammais the per\-step autoregressive decode cost, andδ\\deltais the full\-depth verification forward\-pass cost\. The No\-KV baseline pays the full prefill cost for each of theBBbranches:ℒref=Bαn2\+Btγ\+δ\\mathcal\{L\}\_\{\\text\{ref\}\}=B\\alpha n^\{2\}\+Bt\\gamma\+\\delta\. This model is additive in the four cost terms, enabling clean decomposition of savings\.
### A\.2KV Prefix Sharing Speedup
RKSC computes the prefix KV cache once \(costαn2\\alpha n^\{2\}\) and broadcasts it to allBBbranches, so each branch pays only the suffix attention costβs\\beta sfor a short suffix ofs≪ns\\ll ntokens:
ℒKV=αn2\+Bβs\+Btγ\+δ\.\\mathcal\{L\}\_\{\\text\{KV\}\}=\\alpha n^\{2\}\+B\\,\\beta s\+B\\,t\\,\\gamma\+\\delta\.\(7\)The KV speedup relative to the No\-KV reference is:
SKV=Bαn2\+Btγ\+δαn2\+Bβs\+Btγ\+δ\.S\_\{\\text\{KV\}\}=\\frac\{B\\alpha n^\{2\}\+Bt\\gamma\+\\delta\}\{\\alpha n^\{2\}\+B\\beta s\+Bt\\gamma\+\\delta\}\.\(8\)
###### Proposition A\.1\(KV sharing threshold\)\.
SKV\>1S\_\{\\text\{KV\}\}\>1if and only if\(B−1\)αn2\>Bβs\(B\{\-\}1\)\\alpha n^\{2\}\>B\\beta s, i\.e\., the shared prefix exceeds the critical lengthn∗=Bβs/\(B−1\)αn^\{\*\}=\\sqrt\{B\\beta s/\(B\{\-\}1\)\\alpha\}\.
###### Proof\.
ExpandingSKV\>1S\_\{\\text\{KV\}\}\>1givesℒref\>ℒKV\\mathcal\{L\}\_\{\\text\{ref\}\}\>\\mathcal\{L\}\_\{\\text\{KV\}\}\. Substituting Eqs\. \([7](https://arxiv.org/html/2606.09937#A1.E7)\) and the No\-KV reference, theBtγBt\\gammaandδ\\deltaterms cancel, leaving\(B−1\)αn2\>Bβs\(B\{\-\}1\)\\alpha n^\{2\}\>B\\beta s\. Solving fornnyieldsn\>Bβs/\(B−1\)α=n∗n\>\\sqrt\{B\\beta s/\(B\{\-\}1\)\\alpha\}=n^\{\*\}\. ∎
Equation \([8](https://arxiv.org/html/2606.09937#A1.E8)\) has two important structural properties\. First,SKVS\_\{\\text\{KV\}\}depends onnn,BB,ss, and the cost ratiosα/β\\alpha/\\betaandα/γ\\alpha/\\gamma, but*not*on model family, GQA compression ratio, or task\. This predicts that fixingnnandBBproduces near\-identical KV gains regardless of architecture, a prediction verified empirically in §[3\.3](https://arxiv.org/html/2606.09937#S3.SS3), where the vLLM\-equivalent gain varies by only±0\.03×\\pm 0\.03\\timesacross five heterogeneous model families\. Second,SKVS\_\{\\text\{KV\}\}increases monotonically innnandBBbut decreases intt: longer decode sequences dilute the KV\-dominated prefill fraction\. On A100 with SDPA, empirical timing on Qwen2\.5\-7B givesn∗≈90n^\{\*\}\\approx 90tokens, placing all experiments atn=1,024n\{=\}1\{,\}024in the beneficial regime \(n/n∗≈11n/n^\{\*\}\\approx 11\)\.
#### Worked numerical example\.
The per\-token self\-attention costα\\alphaand per\-token feed\-forward costβ\\betacan be fit by microbenchmarking the prefill path at varyingnnand recovering the quadratic and linear coefficients\. Plugging empirical values into Proposition[A\.1](https://arxiv.org/html/2606.09937#A1.Thmhheorem1)withB=8B\{=\}8ands=16s\{=\}16yieldsn∗≈90n^\{\*\}\\approx 90tokens once kernel\-launch overhead is folded in\. The empirical break\-even \(§[4\.2](https://arxiv.org/html/2606.09937#S4.SS2)\) is higher \(nnbetween 256 and 512\) because fixed SDPA setup costs dominate the linear model at short prefixes, precisely why RKSC uses a runtime probe \(§[2\.2](https://arxiv.org/html/2606.09937#S2.SS2)\) rather than the analytical threshold alone\. Fitted cost constants per model appear in Appendix[E](https://arxiv.org/html/2606.09937#A5)\.
### A\.3Dual\-Level CGEE Speedup
CGEE saves latency through two independent channels\. The verify\-skip gate \(Level 1\) saves the fullδ\\deltaon a fractionρ\\rhoof problems\. The layer\-level entropy exit \(Level 2\) saves a fraction\(1−l∗/L\)\(1\-l^\{\*\}/L\)ofδ\\deltaon a fractionϕ\\phiof the remaining\(1−ρ\)\(1\-\\rho\)problems where entropy stabilises before the final layer\. The combined expected latency with both CGEE levels active is:
𝔼\[ℒCGEE\]=ℒKV−ρδ−\(1−ρ\)ϕδ\(1−l∗/L\),\\mathbb\{E\}\[\\mathcal\{L\}\_\{\\text\{CGEE\}\}\]=\\mathcal\{L\}\_\{\\text\{KV\}\}\-\\rho\\,\\delta\-\(1\{\-\}\\rho\)\\,\\phi\\,\\delta\\,\(1\-l^\{\*\}/L\),\(9\)and the full RKSC speedup relative to the No\-KV baseline is:
SRKSC=ℒref𝔼\[ℒCGEE\]=Bαn2\+Btγ\+δαn2\+Bβs\+Btγ\+δ−ρδ−\(1−ρ\)ϕδ\(1−l∗/L\)\.S\_\{\\text\{RKSC\}\}=\\frac\{\\mathcal\{L\}\_\{\\text\{ref\}\}\}\{\\mathbb\{E\}\[\\mathcal\{L\}\_\{\\text\{CGEE\}\}\]\}=\\frac\{B\\alpha n^\{2\}\+Bt\\gamma\+\\delta\}\{\\alpha n^\{2\}\+B\\beta s\+Bt\\gamma\+\\delta\-\\rho\\delta\-\(1\{\-\}\\rho\)\\phi\\delta\(1\-l^\{\*\}/L\)\}\.\(10\)
###### Corollary A\.2\(Combined RKSC speedup\)\.
Letρ\\rhodenote the fraction of problems on which the verify\-skip gate fires, and let the layer\-level entropy exit fire at layerl∗l^\{\*\}on a fractionϕ\\phiof the remaining\(1−ρ\)\(1\{\-\}\\rho\)problems\. ThenSRKSCS\_\{\\text\{RKSC\}\}is given by Eq\. \([10](https://arxiv.org/html/2606.09937#A1.E10)\), andSRKSC≥SKVS\_\{\\text\{RKSC\}\}\\geq S\_\{\\text\{KV\}\}with equality iffρ=ϕ=0\\rho=\\phi=0\.
###### Proof\.
Substitute Eq\. \([9](https://arxiv.org/html/2606.09937#A1.E9)\) intoSRKSC=ℒref/𝔼\[ℒCGEE\]S\_\{\\text\{RKSC\}\}=\\mathcal\{L\}\_\{\\text\{ref\}\}/\\mathbb\{E\}\[\\mathcal\{L\}\_\{\\text\{CGEE\}\}\]to obtain Eq\. \([10](https://arxiv.org/html/2606.09937#A1.E10)\)\. Both subtractive terms in the denominator are non\-negative, so𝔼\[ℒCGEE\]≤ℒKV\\mathbb\{E\}\[\\mathcal\{L\}\_\{\\text\{CGEE\}\}\]\\leq\\mathcal\{L\}\_\{\\text\{KV\}\}, givingSRKSC≥SKVS\_\{\\text\{RKSC\}\}\\geq S\_\{\\text\{KV\}\}\. Equality holds iffρδ\+\(1−ρ\)ϕδ\(1−l∗/L\)=0\\rho\\delta\+\(1\{\-\}\\rho\)\\phi\\delta\(1\-l^\{\*\}/L\)=0, i\.e\. iffρ=ϕ=0\\rho=\\phi=0\(sinceδ\>0\\delta\>0andl∗<Ll^\{\*\}<L\)\. ∎
### A\.4Structural Interpretation of the Speedup Decomposition
Equation \([10](https://arxiv.org/html/2606.09937#A1.E10)\) decomposes the total latency savings into three mutually disjoint cost terms:
- •The KV\-sharing term\(B−1\)αn2−Bβs\(B\{\-\}1\)\\alpha n^\{2\}\-B\\beta s, which removesB−1B\{\-\}1of theBBprefill costs at the expense of a suffix attention cost\. This term is active on*every*solve call whenσb≥τ\\sigma\_\{b\}\\geq\\tau\.
- •The verify\-skip termρδ\\rho\\delta, which removes the entire verify pass on aρ\\rho\-fraction of problems\. This term is active only on problems where generation confidence is decisive\.
- •The layer\-exit term\(1−ρ\)ϕδ\(1−l∗/L\)\(1\{\-\}\\rho\)\\phi\\delta\(1\-l^\{\*\}/L\), which removes a\(1−l∗/L\)\(1\-l^\{\*\}/L\)\-fraction of the verify pass on the remaining problems\. This term is active whenever verify\-skip does not fire but layer\-level entropy stabilises\.
Because the three savings act on disjoint cost terms \(αn2\\alpha n^\{2\},δ\\deltain full, andδ\\deltain part\), they compose additively in the speedup ratio rather than trading off\. This is the structural reason that combining the three mechanisms produces the observed1\.66×1\.66\\timesmultiplicative improvement over vLLM\-equivalent caching\.
### A\.5Skip\-Rate Adaptivity
The skip rateρ\\rhois an emergent property of the model–task interaction\. On hard problems,maxbp\(b\)\\max\_\{b\}p^\{\(b\)\}is low, inter\-branch gaps are small, andρ\\rhois correspondingly low\. On easier tasks, the model rapidly concentrates confidence on one branch, producing large gaps and highρ\\rho\. This difficulty\-adaptive behaviour means CGEE applies aggressive savings when the model is most confident, and defers to full verification when uncertainty warrants it, matching the qualitative ordering of task difficulty reported in §[4\.4](https://arxiv.org/html/2606.09937#S4.SS4)\.
## Appendix BAlgorithmic Details
This appendix provides detailed pseudocode for the three RKSC mechanisms\. Algorithm[2](https://arxiv.org/html/2606.09937#alg2)details the ASKS similarity gate; Algorithm[3](https://arxiv.org/html/2606.09937#alg3)details the CGEE layer\-level entropy exit; Algorithm[4](https://arxiv.org/html/2606.09937#alg4)details RSBCM eviction\.
Algorithm 2AsksGate: similarity\-gated KV reuse0:root hidden states
𝐇root=\{𝐡root\(l\)\}l=1L\\mathbf\{H\}\_\{\\text\{root\}\}=\\\{\\mathbf\{h\}^\{\(l\)\}\_\{\\text\{root\}\}\\\}\_\{l=1\}^\{L\}\(pre\-normalised\), branch suffixes
\{𝐬b\}b=1B\\\{\\mathbf\{s\}\_\{b\}\\\}\_\{b=1\}^\{B\}, threshold
τ\\tau, exponent
αw\\alpha\_\{w\}
1:compute weights
wl←exp\(αwl/L\)/∑l′=1Lexp\(αwl′/L\)w\_\{l\}\\leftarrow\\exp\(\\alpha\_\{w\}l/L\)/\\sum\_\{l^\{\\prime\}=1\}^\{L\}\\exp\(\\alpha\_\{w\}l^\{\\prime\}/L\)for
l=1,…,Ll=1,\\ldots,L
2:
sharedSet←∅\\text\{sharedSet\}\\leftarrow\\emptyset
3:for
b=1,…,Bb=1,\\ldots,Bdo
4:
\{𝐡b\(l\)\}←fθsuffix\(𝐬b\)\\\{\\mathbf\{h\}^\{\(l\)\}\_\{b\}\\\}\\leftarrow f\_\{\\theta\}^\{\\text\{suffix\}\}\(\\mathbf\{s\}\_\{b\}\)\{per\-layer hidden states for this suffix\}
5:
σb←∑l=1Lwl⋅⟨𝐡b\(l\)/‖𝐡b\(l\)‖,𝐡root\(l\)⟩\\sigma\_\{b\}\\leftarrow\\sum\_\{l=1\}^\{L\}w\_\{l\}\\cdot\\langle\\mathbf\{h\}^\{\(l\)\}\_\{b\}/\\\|\\mathbf\{h\}^\{\(l\)\}\_\{b\}\\\|,\\,\\mathbf\{h\}^\{\(l\)\}\_\{\\text\{root\}\}\\rangle
6:if
σb≥τ\\sigma\_\{b\}\\geq\\tauthen
7:
sharedSet\.add\(b\)\\text\{sharedSet\}\.\\text\{add\}\(b\)
8:endif
9:endfor
10:returnsharedSet
Algorithm 3VerifyWithEntropyExit: layer\-level entropy\-stabilisation exit0:candidate tokens
\{𝐲b\}\\\{\\mathbf\{y\}\_\{b\}\\\}, unembedding
WuW\_\{u\}, thresholds
θ\\theta,
ϵ\\epsilon,
lminl\_\{\\min\}
1:install forward hooks on each transformer layer \(viaregister\_forward\_hook\)
2:
Hprev←\+∞H\_\{\\text\{prev\}\}\\leftarrow\+\\infty
3:for
l=1,…,Ll=1,\\ldots,Ldo
4:compute layer output
𝐡\(l\)\\mathbf\{h\}^\{\(l\)\}\(forward pass up to layer
ll\)
5:
𝐳←𝐡last\-token\(l\)⋅Wu⊤\\mathbf\{z\}\\leftarrow\\mathbf\{h\}^\{\(l\)\}\_\{\\text\{last\-token\}\}\\cdot W\_\{u\}^\{\\top\}\{unembedding projection\}
6:
𝐩←softmax\(𝐳\)\\mathbf\{p\}\\leftarrow\\mathrm\{softmax\}\(\\mathbf\{z\}\)
7:
H\(l\)←−∑v𝐩vlog𝐩vH^\{\(l\)\}\\leftarrow\-\\sum\_\{v\}\\mathbf\{p\}\_\{v\}\\log\\mathbf\{p\}\_\{v\}
8:if
l≥lminl\\geq l\_\{\\min\}and
H\(l\)<θH^\{\(l\)\}<\\thetaand
\|H\(l\)−Hprev\|<ϵ\|H^\{\(l\)\}\-H\_\{\\text\{prev\}\}\|<\\epsilonthen
9:raise\_EarlyExitSignal\(
𝐩\\mathbf\{p\}\) \{caught by solver; hooks removed in finally block\}
10:endif
11:
Hprev←H\(l\)H\_\{\\text\{prev\}\}\\leftarrow H^\{\(l\)\}
12:endfor
13:remove hooks;return
softmax\(𝐡\(L\)Wu⊤\)\\mathrm\{softmax\}\(\\mathbf\{h\}^\{\(L\)\}W\_\{u\}^\{\\top\}\)\{full\-depth result if no early exit\}
Algorithm 4RsbcmEvict: attention\-weighted depth\-priority eviction0:block table
ℬ=\{\(blocki,scorei,depthi\)\}\\mathcal\{B\}=\\\{\(\\text\{block\}\_\{i\},\\text\{score\}\_\{i\},\\text\{depth\}\_\{i\}\)\\\}, capacity
NmaxN\_\{\\max\}
1:while
\|ℬ\|\>Nmax\|\\mathcal\{B\}\|\>N\_\{\\max\}do
2:
ωi←scorei/\(depthi\+1\)\\omega\_\{i\}\\leftarrow\\text\{score\}\_\{i\}/\(\\text\{depth\}\_\{i\}\+1\)for all
ii
3:
i∗←argminiωii^\{\*\}\\leftarrow\\arg\\min\_\{i\}\\omega\_\{i\}
4:evict block
i∗i^\{\*\}from cache
5:
ℬ←ℬ∖\{\(blocki∗,scorei∗,depthi∗\)\}\\mathcal\{B\}\\leftarrow\\mathcal\{B\}\\setminus\\\{\(\\text\{block\}\_\{i^\{\*\}\},\\text\{score\}\_\{i^\{\*\}\},\\text\{depth\}\_\{i^\{\*\}\}\)\\\}
6:endwhile
## Appendix CHyperparameter Settings and Sensitivity
### C\.1Complete Hyperparameter Table
Table[4](https://arxiv.org/html/2606.09937#A3.T4)collects every hyperparameter used in RKSC, split into fixed\-across\-models settings \(top block\) and per\-model calibrated settings \(bottom block\)\. Calibrated thresholds are obtained on 30 held\-out GPQA Diamond problems disjoint from the evaluation set\.
Table 4:Complete RKSC hyperparameter settings\. Values marked*calibrated*are fit per model on 30 held\-out problems\.Theτconf\\tau\_\{\\text\{conf\}\}andrgapr\_\{\\text\{gap\}\}values are shared across all models because they are set to percentiles of each model’s own confidence distribution, making them self\-normalising\. Theθ\\thetavalues differ because Mistral\-7B’s logit distributions are substantially sharper \(lower entropy\) than the other models\.
### C\.2Prefix\-Length Sweep
Table 5:End\-to\-end speedup vs prefix length \(Qwen2\.5\-7B,N=20N\{=\}20,B=8B\{=\}8\)\.
### C\.3ASKS Threshold Sweep
Table[6](https://arxiv.org/html/2606.09937#A3.T6)reports the fullτ\\tau\-sweep referenced in §[4\.3](https://arxiv.org/html/2606.09937#S4.SS3)\. In the identical\-prefix regime, all four settings produce statistically indistinguishable speedups atn≈512n\\approx 512, confirming that the gate decision isτ\\tau\-insensitive in this regime\.
Table 6:ASKS threshold sweep atn≈512n\\approx 512,N=20N\{=\}20problems on Qwen2\.5\-7B\. Speedups are indistinguishable at all four settings, confirmingτ\\tau\-insensitivity in the identical\-prefix regime\.
### C\.4CGEE Cross\-Dataset Calibration
The entropy thresholdθ\\thetais calibrated on GPQA Diamond\. Table[7](https://arxiv.org/html/2606.09937#A3.T7)shows the consequences of using a single GPQA\-calibratedθ\\thetaversus per\-dataset calibration\. Per\-datasetθ\\thetacalibration via a small held\-out sweep \(5–10 problems\) restores gate activity on out\-of\-distribution datasets without loss in accuracy\.
Table 7:CGEE Level\-2 exit activity: GPQA\-calibratedθ\\thetavs per\-dataset\-calibratedθ\\theta\(Qwen2\.5\-7B\)\. The GPQA\-calibratedθ=8\.0\\theta\{=\}8\.0does not activate on out\-of\-distribution datasets with lower entropy profiles; per\-dataset calibration \(5–10 problems\) restores full activity\.This cross\-domain behaviour is documented as a limitation in §[5](https://arxiv.org/html/2606.09937#S5)\. The main\-paper results use per\-datasetθ\\thetathroughout; the GPQA\-calibrated value is the*default*for deployment without a calibration sweep\.
## Appendix DPer\-Model Accuracy Analysis
Table[8](https://arxiv.org/html/2606.09937#A4.T8)decomposes the aggregate accuracy impact \(Table[3](https://arxiv.org/html/2606.09937#S3.T3)\) across the 20 model–dataset pairs\. Six errors concentrate in three pairs; the remaining 17 pairs exhibit zero CGEE\-induced errors relative to full\-depth verification\.
Table 8:Per\-model, per\-dataset CGEE accuracy impact\.Errors: number of problems where CGEE’s answer diverges from full\-depth verify\.𝚫\\boldsymbol\{\\Delta\}Acc: CGEE accuracy minus full\-verify accuracy \(N=50N\{=\}50per cell\)\.ModelDatasetErrorsΔ\\DeltaAccSkip rateQwen2\.5\-7BGPQA00\.0%0\.0\\%3%GSM8K00\.0%0\.0\\%32%ARC\-C00\.0%0\.0\\%71%MMLU00\.0%0\.0\\%70%Mistral\-7BGPQA00\.0%0\.0\\%23%GSM8K00\.0%0\.0\\%30%ARC\-C3−4\.0%\-4\.0\\%60%MMLU00\.0%0\.0\\%47%Falcon3\-7BGPQA00\.0%0\.0\\%54%GSM8K00\.0%0\.0\\%50%ARC\-C00\.0%0\.0\\%66%MMLU00\.0%0\.0\\%39%Falcon3\-10BGPQA00\.0%0\.0\\%72%GSM8K00\.0%0\.0\\%57%ARC\-C2−2\.0%\-2\.0\\%77%MMLU00\.0%0\.0\\%66%Llama\-3\-8BGPQA00\.0%0\.0\\%57%GSM8K1−2\.0%\-2\.0\\%65%ARC\-C00\.0%0\.0\\%74%MMLU00\.0%0\.0\\%92%Total,6−0\.2%\\mathbf\{\-0\.2\\%\}\(mean\)55% \(mean\)### D\.1Failure Case Analysis
Examining the six CGEE\-induced errors reveals a consistent pattern: in each case, one branch emits the answer letter with high absolute probability while simultaneously carrying a subtle formatting deviation \(e\.g\., a trailing period, different case, or embedded justification\) that full\-depth verification penalises but generation confidence does not\. On ARC\-Challenge, this manifests as the model fluently emitting, say, “C\.” withp\>0\.9p\>0\.9while full\-depth verification prefers the branch that emitted “C” without punctuation because the verifier has learned the benchmark’s exact\-match convention\.
This is not a failure of the entropy signal itself but of the verify\-skip gate’s reliance on generation confidence as a*proxy*for correctness on multiple\-choice tasks\. On free\-form tasks \(GSM8K, GPQA\), where the answer letter is embedded in longer reasoning and generation confidence more directly reflects understanding, the error rate drops to0−10\{\-\}1per 50 problems\. The dual condition in Eq\. \([4](https://arxiv.org/html/2606.09937#S2.E4)\), requiring both high absolute confidence and a clear relative margin, prevents the naive failure mode of a single low\-confidence branch firing the skip, but does not guard against correlated fluency\-vs\-correctness divergence on format\-sensitive tasks\. Users running on such benchmarks should either raiseτconf\\tau\_\{\\text\{conf\}\}or disable Level 1 and rely on Level 2 alone\.
## Appendix EEmpirical Cost Constants
Table[9](https://arxiv.org/html/2606.09937#A5.T9)reports the KV probe measurements for all five models, obtained by running the self\-calibrating runtime probe \(§[2\.2](https://arxiv.org/html/2606.09937#S2.SS2)\) atn≈1,024n\{\\approx\}1\{,\}024tokens,B=8B\{=\}8\. The probe times three paths, batched full recompute, single root forward, and batched suffix decode with shared cache, and confirms that KV sharing is beneficial for all models at this prefix length\.
Table 9:KV probe measurements on A100\-80GB with SDPA,B=8B\{=\}8,n≈1,024n\{\\approx\}1\{,\}024tokens\.*batch\_full*: batched prefill with no sharing \(ms\)\.*prefix\_fwd*: single root forward pass \(ms\)\.*sfx\+KV*: batched suffix decode conditioned on shared cache \(ms\)\.*net*: latency saved per solve call \(ms\)\.B∗B^\{\*\}: break\-even branching factor; KV sharing is beneficial for allB\>B∗B\>B^\{\*\}\.All five models achieveB∗≤1\.7B^\{\*\}\\leq 1\.7, confirming KV sharing is beneficial for anyB≥2B\\geq 2\. At the experimental setting ofB=8B\{=\}8, the net saving per solve call ranges from 445 ms \(Falcon3\-7B\) to 627 ms \(Falcon3\-10B\), consistent with the latency model prediction that larger models benefit more \(Eq\. \([8](https://arxiv.org/html/2606.09937#A1.E8)\)\)\. The near\-identicalB∗B^\{\*\}values \(1\.6–1\.7\) across architectures spanning three training lineages, two vocabulary sizes, and 7B–10B parameters confirms the architecture\-invariance claim in §[3\.3](https://arxiv.org/html/2606.09937#S3.SS3): the break\-even condition\(B−1\)αn2\>Bβs\(B\{\-\}1\)\\alpha n^\{2\}\>B\\beta s\(Proposition[A\.1](https://arxiv.org/html/2606.09937#A1.Thmhheorem1)\) depends on the cost ratioα/β\\alpha/\\beta, which is approximately constant across GQA model families at fixed hardware\.
## Appendix FQualitative Examples
We present two contrasting qualitative examples to illustrate when CGEE fires and when it defers\.
#### Example 1: CGEE\-skip fires \(ARC\-Challenge, Qwen2\.5\-7B\)\.
> Problem:Which of the following is the primary greenhouse gas released by livestock digestion? \(A\) Carbon dioxide \(B\) Methane \(C\) Water vapour \(D\) Nitrous oxide
All 8 branches converge on “\(B\) Methane” within 3 decode steps\. Top\-1 probabilities:maxbp\(b\)=0\.94\\max\_\{b\}p^\{\(b\)\}=0\.94; 2nd\-max=0\.22=0\.22; gap ratio=0\.77=0\.77\. Both conditions of Eq\. \([4](https://arxiv.org/html/2606.09937#S2.E4)\) fire and the verify pass is skipped entirely\. Full\-depth verify, run separately, confirms branch 3 is the correct answer, agreeing with the skipped decision\. Latency:321321ms \(skip\) vs512512ms \(full verify\), a1\.59×1\.59\\timesreduction on this single problem\.
#### Example 2: CGEE\-skip defers \(GPQA Diamond, Qwen2\.5\-7B\)\.
> Problem:A particle of massmmis placed in a 3D isotropic harmonic oscillator potential with frequencyω\\omega\. What is the degeneracy of the first excited state?
Branches split: 4 emit “3”, 3 emit “6”, 1 emits “22”\.maxbp\(b\)=0\.61\\max\_\{b\}p^\{\(b\)\}=0\.61; 2nd\-max=0\.54=0\.54; gap ratio=0\.11<rgap=0\.11<r\_\{\\text\{gap\}\}\. The relative\-gap condition fails, so the gate defers to full verification\. Level 2 \(layer\-level entropy exit\) fires at layer 19 out of 28, saving≈32%\\approx 32\\%of the verify cost\. Full verify selects the “3” branches \(correct: the first excited state has degeneracy 3 for the 3D isotropic oscillator\)\. Latency:641641ms \(Level\-2 exit\) vs892892ms \(full verify\), a1\.39×1\.39\\timesreduction\. This example illustrates the difficulty\-adaptive behaviour quantified in §[4\.4](https://arxiv.org/html/2606.09937#S4.SS4)\.
These examples are illustrative and the specific latencies quoted are from individual runs rather than averages; aggregate figures appear in Table[2](https://arxiv.org/html/2606.09937#S3.T2)\.
## Appendix GPer\-Layer Entropy Dynamics
Table[10](https://arxiv.org/html/2606.09937#A7.T10)shows the typical per\-layer entropy trajectory during a verification forward pass on Qwen2\.5\-7B \(28 layers\)\. The trajectory confirms the premise of Level 2: entropy decays approximately monotonically through the network, with a rapid drop in the middle layers followed by a plateau that begins around layer 17–19 and persists to the final layer\.
Table 10:Typical per\-layer entropy trajectory on Qwen2\.5\-7B verify pass \(means over 30 held\-out GPQA problems\)\.*Layer*denotes the transformer layer index \(1–28\);*Entropy*isH\(l\)H^\{\(l\)\}from Eq\. \([5](https://arxiv.org/html/2606.09937#S2.E5)\)\.The stabilisation condition\|H\(l\)−H\(l−1\)\|<ϵ=3\.0\|H^\{\(l\)\}\-H^\{\(l\-1\)\}\|<\\epsilon=3\.0is first satisfied at layerl∗=19l^\{\*\}=19on average, matching the reported mean exit layer of 18\.4 in the main text\. Between layer 19 and layer 28, entropy changes by less than0\.040\.04nats, well below any decision\-changing threshold for the argmax over the vocabulary on this task, confirming that the remaining34%34\\%of layers do not alter the verification outcome and can safely be skipped\. The values in Table[10](https://arxiv.org/html/2606.09937#A7.T10)are representative means from profiling runs on GPQA Diamond hold\-out problems; exact per\-problem curves are available in the released code\.
## Appendix HRSBCM Deep\-Tree Validation
To validate RSBCM in the regime where evictions actually fire, we run a stress\-test configuration with capacityNmax=4N\_\{\\max\}=4and branchingB=8B=8on 20 GPQA Diamond problems, forcing exactlyB−Nmax=4B\-N\_\{\\max\}=4evictions per solve call\. Table[11](https://arxiv.org/html/2606.09937#A8.T11)reports the results\.
Table 11:RSBCM stress test on Qwen2\.5\-7B \(N=20N\{=\}20problems,B=8B\{=\}8,Nmax=4N\_\{\\max\}\{=\}4\)\.The eviction count matches the expected value exactly, confirming that the importance scoreω=score/\(depth\+1\)\\omega=\\text\{score\}/\(\\text\{depth\}\+1\)correctly identifies and retires the lowest\-value blocks\. Overhead is statistically indistinguishable from zero, and no answer regressions were observed\. RSBCM is therefore dormant in the default single\-depth experiments reported in the main paper \(whereB<Nmax=2000B<N\_\{\\max\}=2000always holds\), but available as a safety mechanism for deep tree search\.
## Appendix IImplementation Details
### I\.1Stepwise Batched Decode
RKSC replacesgenerate\(\)with a directmodel\(\)loop: \(1\) suffix prefill conditioned on𝒞B\\mathcal\{C\}\_\{B\}; \(2\) token\-by\-token decode with pre\-allocated attention masks sliced per step; \(3\) early termination on EOS or decode\-level CGEE gate \(θdec=0\.72\\theta\_\{\\text\{dec\}\}\{=\}0\.72,rdec=0\.10r\_\{\\text\{dec\}\}\{=\}0\.10\)\. EliminatingGenerationMixinscaffolding removes logit processors, stopping criteria, and beam tracking that are not needed in this pipeline, saving approximately8−12%8\{\-\}12\\%of end\-to\-end decode time measured on Qwen2\.5\-7B\.
### I\.2CGEE Layer Hooks
Hooks are installed viaregister\_forward\_hookon each transformer layer module\. Each hook extracts the last\-token hidden state, projects it through the cached unembedding matrixWuW\_\{u\}, computes softmax entropy, and raises an\_EarlyExitSignalexception when exit conditions are met\. The solver catches this exception in atry/finallyblock that guarantees hook removal even if an exit fires, preventing a memory leak that would otherwise persist across solve calls\. On Qwen2\.5\-7B, the per\-layer hook overhead is≈40μ\\approx 40\\,\\mus, which is recovered many times over whenever an exit fires before the final layer\.
### I\.3KV Cache Expansion
The prefix cache is expanded via\.repeat\_interleave\(B, dim=0\)with\.clone\(\)for contiguous memory\. Direct views \(without\.clone\(\)\) exposed an SDPA kernel bug on non\-contiguous inputs in our PyTorch 2\.3\.1 build; the explicit clone resolves this at the cost of one additional tensor allocation per solve call, which is negligible atn=1024n=1024but grows withB⋅L⋅dkB\\cdot L\\cdot d\_\{k\}\.
### I\.4Memory Release
unload\_model\(\)explicitly popsarch\[’\_unembed\_weight’\]before deleting the model, preventing a∼1\.8\{\\sim\}1\.8GB reference leak for 9B models, then runs 3 GC passes andtorch\.cuda\.empty\_cache\(\)\. Without this explicit pop, PyTorch’s reference counting retains the unembedding matrix via the cached reference in the CGEE hook infrastructure, causing accumulating leaks across sequential solves\.
### I\.5TF32 and bfloat16 Configuration
All experiments use bfloat16 model weights with TF32\-enabled matrix multiplication \(torch\.backends\.cuda\.matmul\.allow\_tf32 = True\) and SDPA attention \(scaled\_dot\_product\_attention\)\. TF32 provides approximately5%5\\%additional throughput at negligible precision loss; disabling it produces identical answers but≈5%\\approx 5\\%higher latency across all conditions uniformly \(so the reported*ratios*are TF32\-independent\)\.
## Appendix JReproducibility Checklist
#### Hardware and software\.
- •Single NVIDIA A100\-80GB SXM4 GPU, PCIe 4\.0\.
- •PyTorch 2\.3\.1, Transformers 4\.44\.0, CUDA 12\.1, SDPA attention backend, bfloat16 \+ TF32\.
- •Ubuntu 22\.04, Python 3\.10\.
#### Determinism\.
- •Seeds:torch\.manual\_seed\(42\),numpy\.random\.seed\(42\),torch\.cuda\.manual\_seed\_all\(42\),random\.seed\(42\)at the start of every condition\.
- •torch\.backends\.cudnn\.deterministic = Trueandtorch\.backends\.cudnn\.benchmark = Falsefor the warmup phase; benchmark re\-enabled for timing measurements to allow kernel autotuning \(identical kernel across all conditions ensures timing fairness\)\.
#### Timing\.
- •Nruns=2N\_\{\\text\{runs\}\}=2–33per problem,Nwarmup=2N\_\{\\text\{warmup\}\}=2per condition,torch\.cuda\.synchronize\(\)before each measurement\.
- •Warmup discarded; reported latency is mean over theNrunsN\_\{\\text\{runs\}\}measurements and problems\.
- •Bootstrap 95% confidence intervals viaBboot=10,000B\_\{\\text\{boot\}\}=10\{,\}000resamples of the problem\-level latencies\.
#### Data\.
- •Datasets: GPQA Diamond \(rein2023\), GSM8K \(cobbe2021\), ARC\-Challenge \(allenai:arc\), MMLU\-STEM \(hendryckstest2021\) via HuggingFace Datasets\.
- •Evaluation sets: firstN=50N\{=\}50problems per dataset \(main eval\), firstN=30N\{=\}30for throughput decomposition,N=15N\{=\}15for verify isolation,N=20N\{=\}20for prefix\-length sweep\. Fixed indices used across conditions to ensure paired comparisons\.
- •Held\-out calibration set: 30 GPQA Diamond problems disjoint from evaluation\.
- •Padding: neutral filler string prepended iteratively to≈1,024\{\\approx\}1\{,\}024tokens; hard cap atmax\_seq\_len−256\\text\{max\\\_seq\\\_len\}\-256tokens to reserve decode headroom\.
#### Models\.
- •All models loaded withAutoModelForCausalLM\.from\_pretrained\(…, torch\_dtype=torch\.bfloat16, attn\_implementation="sdpa"\)\.
- •Checkpoints: Qwen2\.5\-7B\-Instruct, Mistral\-7B\-Instruct\-v0\.3, Falcon3\-7B\-Instruct, Falcon3\-10B\-Instruct, Meta\-Llama\-3\-8B\-Instruct \(HuggingFace hub\)\.
- •Chat templates: each model’s nativeapply\_chat\_templateused verbatim\.
#### Branching\.
- •B=8B=8branches throughout; 8 fixed reasoning\-hint suffixes cycled across branches \(suffixes listed verbatim in released code\)\.
#### Code and artefacts\.
- •
- •Released artefacts include: the RKSC solver, calibration scripts, all timing harnesses, bootstrap analysis scripts, and per\-problem latency logs\.
#### Expected compute\.
Reproducing the full set of main\-paper results requires approximately 42 A100\-GPU\-hours \(itemised in §[3\.1](https://arxiv.org/html/2606.09937#S3.SS1)\)\.Similar Articles
TTKV: Temporal-Tiered KV Cache for Long-Context LLM Inference
TTKV introduces a temporal-tiered KV cache that mimics human memory to cut 128K-context LLM inference latency by 76% and double throughput while reducing cross-tier traffic 5.94×.
Reformulating KV Cache Eviction Problem for Long-Context LLM Inference
This paper introduces LaProx, a novel KV Cache eviction strategy for long-context LLM inference that reformulates the problem as an output-aware matrix multiplication approximation, achieving high performance with only 5% cache usage.
River-LLM: Large Language Model Seamless Exit Based on KV Share
River-LLM proposes a training-free early-exit framework for decoder-only LLMs that uses KV-sharing to eliminate KV-cache gaps, achieving 1.71–2.16× speedup without quality loss.
CONF-KV: Confidence-Aware KV Cache Eviction with Mixed-Precision Storage for Long-Horizon LLM
CONF-KV is a KV-cache management system that uses model uncertainty to dynamically adjust cache retention, improving memory efficiency for long-context LLM inference while maintaining accuracy within 1.5-2.1 perplexity points.
KV Cache Is Becoming the Memory Hierarchy of Inference
The article discusses how the KV cache is evolving into a memory hierarchy for LLM inference, optimizing memory management during decoding.