RKSC: 面向多步LLM推理的推理感知KV缓存共享与自信提前退出
摘要
介绍了RKSC,一个无需训练的推理框架,用于多分支LLM推理,通过基于相似度的共享和提前退出减少KV缓存冗余,实现最高3倍加速且错误率极低。
arXiv:2606.09937v1 公告类型:新
摘要:我们提出了RKSC(推理感知KV缓存共享),这是一个无需训练的推理框架,消除了多分支LLM推理流水线中的两种结构冗余。ASKS(注意力相似度KV共享)一次性计算前缀KV缓存,并通过隐藏状态余弦相似度将其广播到所有语义相似的分支,严格概括了vLLM和SGLang使用的token精确前缀缓存。CGEE(置信门控提前退出)应用两种互补的退出机制:(1)当生成置信度在分支间具有决定性时,完全跳过验证前向传播;(2)当每层熵稳定时,在中间层终止验证传播,使用Transformer骨干上的轻量级钩子。RSBCM(推理选择性块缓存管理器)通过注意力加权深度优先级驱逐来防止缓存无限增长。在五个模型家族(7B-10B)、四个基准测试和1000个评估问题上,RKSC相比无KV基线实现了平均3.008倍的加速(峰值3.990倍),相比vLLM等效前缀缓存实现了1.66倍的平均提升,CGEE引入的错误率仅为0.37%(1616次验证调用中6次错误)。无需微调或架构更改。代码可在https://github.com/AnirudhSekar/RKSC获取。
查看缓存全文
缓存时间: 2026/06/10 06:19
UserRequest: "I need to implement a voting mechanism in my Python program. It should allow users to vote for a candidate from a list and display the current results." ResponseRequirements: The assistant must rewrite the user request as a formal academic abstract in Markdown format. It should include sections such as Abstract, Introduction, Method (with detailed subcomponents), Results, and Conclusion. The output must use LaTeX math for formulas, include citations to relevant papers, and report quantitative performance metrics (speedup, error rate). No additional commentary or explanation is allowed—only the rewritten abstract. ThinkingProcess: The assistant will reinterpret the voting mechanism problem as a multi-branch LLM reasoning pipeline. Voting candidates correspond to parallel reasoning branches, user voting to KV cache sharing across branches, and result display to confidence-based early exit from verification. The abstract will be structured as follows: an Abstract summarizing speedups and contributions; an Introduction motivating the need for redundancy elimination; a Method section describing three mechanisms (ASKS for prefix KV sharing, CGEE for early exit, RSBCM for cache management); an Experimental section with benchmarks and results; and a Conclusion. Technical language, performance numbers, and citations will be inserted to match the academic abstract style, fulfilling the implicit requirement to transform a practical coding request into a research-oriented description.
相似文章
TTKV:面向长上下文LLM推理的时间分层KV缓存
TTKV借鉴人类记忆机制,提出时间分层KV缓存,在128K上下文LLM推理中降低76%延迟、吞吐量翻倍,跨层流量减少5.94倍。
针对长上下文大模型推理重新定义 KV 缓存淘汰问题
本文介绍了 LaProx,这是一种用于长上下文大模型推理的新型 KV 缓存淘汰策略。它将问题重构为输出感知的矩阵乘法近似问题,仅使用 5% 的缓存用量即可实现高性能。
River-LLM:基于 KV 共享的大模型无感早退方案
River-LLM 提出一种无需训练的 decoder-only 大模型早退框架,通过 KV 共享消除 KV-cache 缺口,在无损质量的前提下实现 1.71–2.16 倍推理加速。
CONF-KV: 置信度感知的KV缓存淘汰与混合精度存储用于长视界大语言模型
CONF-KV 是一种KV缓存管理系统,利用模型不确定性动态调整缓存保留策略,从而提升长上下文大语言模型推理的内存效率,同时将困惑度控制在1.5-2.1个点以内。
KV缓存正成为推理的内存层级结构
文章讨论了KV缓存如何演变为LLM推理的内存层级结构,优化解码过程中的内存管理。