River-LLM: Large Language Model Seamless Exit Based on KV Share

Hugging Face Daily Papers Papers

Summary

River-LLM proposes a training-free early-exit framework for decoder-only LLMs that uses KV-sharing to eliminate KV-cache gaps, achieving 1.71–2.16× speedup without quality loss.

Large Language Models (LLMs) have demonstrated exceptional performance across diverse domains but are increasingly constrained by high inference latency. Early Exit has emerged as a promising solution to accelerate inference by dynamically bypassing redundant layers. However, in decoder-only architectures, the efficiency of Early Exit is severely bottlenecked by the KV Cache Absence problem, where skipped layers fail to provide the necessary historical states for subsequent tokens. Existing solutions, such as recomputation or masking, either introduce significant latency overhead or incur severe precision loss, failing to bridge the gap between theoretical layer reduction and practical wall-clock speedup. In this paper, we propose River-LLM, a training-free framework that enables seamless token-level Early Exit. River-LLM introduces a lightweight KV-Shared Exit River that allows the backbone's missing KV cache to be naturally generated and preserved during the exit process, eliminating the need for costly recovery operations. Furthermore, we utilize state transition similarity within decoder blocks to predict cumulative KV errors and guide precise exit decisions. Extensive experiments on mathematical reasoning and code generation tasks demonstrate that River-LLM achieves 1.71 to 2.16 times of practical speedup while maintaining high generation quality.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/22/26, 01:58 AM

Paper page - River-LLM: Large Language Model Seamless Exit Based on KV Share

Source: https://huggingface.co/papers/2604.18396

Abstract

River-LLM enables efficient token-level early exit in decoder-only LLMs through KV-sharing mechanisms that preserve historical states without latency overhead.

Large Language Models (LLMs) have demonstrated exceptional performance across diverse domains but are increasingly constrained by high inference latency.Early Exithas emerged as a promising solution to accelerate inference by dynamically bypassing redundant layers. However, indecoder-only architectures, the efficiency ofEarly Exitis severely bottlenecked by theKV Cache Absenceproblem, where skipped layers fail to provide the necessary historical states for subsequent tokens. Existing solutions, such as recomputation or masking, either introduce significant latency overhead or incur severe precision loss, failing to bridge the gap between theoretical layer reduction and practical wall-clock speedup. In this paper, we propose River-LLM, atraining-free frameworkthat enables seamlesstoken-level Early Exit. River-LLM introduces a lightweightKV-Shared Exit Riverthat allows the backbone’s missing KV cache to be naturally generated and preserved during the exit process, eliminating the need for costly recovery operations. Furthermore, we utilizestate transition similaritywithin decoder blocks to predictcumulative KV errorsand guide precise exit decisions. Extensive experiments on mathematical reasoning and code generation tasks demonstrate that River-LLM achieves 1.71 to 2.16 times of practical speedup while maintaining high generation quality.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2604\.18396

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.18396 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.18396 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.18396 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs

Hugging Face Daily Papers

KV Packet proposes a recomputation-free cache reuse framework for LLMs that uses trainable soft-token adapters to bridge context discontinuities, eliminating overhead while maintaining performance comparable to full recomputation baselines on Llama-3.1 and Qwen2.5.

JumpLoRA: Sparse Adapters for Continual Learning in Large Language Models

arXiv cs.CL

JumpLoRA introduces a novel sparse adapter framework for continual learning in LLMs using JumpReLU gating to dynamically isolate task parameters and prevent catastrophic forgetting. The method enhances LoRA-based approaches and outperforms state-of-the-art continual learning methods like ELLA.