River-LLM: Large Language Model Seamless Exit Based on KV Share

Hugging Face Daily Papers 04/20/26, 12:00 AM Papers

Summary

River-LLM proposes a training-free early-exit framework for decoder-only LLMs that uses KV-sharing to eliminate KV-cache gaps, achieving 1.71–2.16× speedup without quality loss.

Large Language Models (LLMs) have demonstrated exceptional performance across diverse domains but are increasingly constrained by high inference latency. Early Exit has emerged as a promising solution to accelerate inference by dynamically bypassing redundant layers. However, in decoder-only architectures, the efficiency of Early Exit is severely bottlenecked by the KV Cache Absence problem, where skipped layers fail to provide the necessary historical states for subsequent tokens. Existing solutions, such as recomputation or masking, either introduce significant latency overhead or incur severe precision loss, failing to bridge the gap between theoretical layer reduction and practical wall-clock speedup. In this paper, we propose River-LLM, a training-free framework that enables seamless token-level Early Exit. River-LLM introduces a lightweight KV-Shared Exit River that allows the backbone's missing KV cache to be naturally generated and preserved during the exit process, eliminating the need for costly recovery operations. Furthermore, we utilize state transition similarity within decoder blocks to predict cumulative KV errors and guide precise exit decisions. Extensive experiments on mathematical reasoning and code generation tasks demonstrate that River-LLM achieves 1.71 to 2.16 times of practical speedup while maintaining high generation quality.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 04/22/26, 01:58 AM

Paper page - River-LLM: Large Language Model Seamless Exit Based on KV Share

Source: https://huggingface.co/papers/2604.18396

Abstract

River-LLM enables efficient token-level early exit in decoder-only LLMs through KV-sharing mechanisms that preserve historical states without latency overhead.

Large Language Models (LLMs) have demonstrated exceptional performance across diverse domains but are increasingly constrained by high inference latency.Early Exithas emerged as a promising solution to accelerate inference by dynamically bypassing redundant layers. However, indecoder-only architectures, the efficiency ofEarly Exitis severely bottlenecked by theKV Cache Absenceproblem, where skipped layers fail to provide the necessary historical states for subsequent tokens. Existing solutions, such as recomputation or masking, either introduce significant latency overhead or incur severe precision loss, failing to bridge the gap between theoretical layer reduction and practical wall-clock speedup. In this paper, we propose River-LLM, atraining-free frameworkthat enables seamlesstoken-level Early Exit. River-LLM introduces a lightweightKV-Shared Exit Riverthat allows the backbone’s missing KV cache to be naturally generated and preserved during the exit process, eliminating the need for costly recovery operations. Furthermore, we utilizestate transition similaritywithin decoder blocks to predictcumulative KV errorsand guide precise exit decisions. Extensive experiments on mathematical reasoning and code generation tasks demonstrate that River-LLM achieves 1.71 to 2.16 times of practical speedup while maintaining high generation quality.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2604\.18396

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.18396 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.18396 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.18396 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

River-LLM: Large Language Model Seamless Exit Based on KV Share

Paper page - River-LLM: Large Language Model Seamless Exit Based on KV Share

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction

@pallavishekhar_: KV Cache in LLMs Read here: https://outcomeschool.com/blog/kv-cache-in-llms…

Two-dimensional early exit optimisation of LLM inference

KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs

JumpLoRA: Sparse Adapters for Continual Learning in Large Language Models

Submit Feedback

Similar Articles

$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction

@pallavishekhar_: KV Cache in LLMs Read here: https://outcomeschool.com/blog/kv-cache-in-llms…

Two-dimensional early exit optimisation of LLM inference

KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs

JumpLoRA: Sparse Adapters for Continual Learning in Large Language Models