RKSC: 面向多步LLM推理的推理感知KV缓存共享与自信提前退出

arXiv cs.LG 论文

摘要

介绍了RKSC,一个无需训练的推理框架,用于多分支LLM推理,通过基于相似度的共享和提前退出减少KV缓存冗余,实现最高3倍加速且错误率极低。

arXiv:2606.09937v1 公告类型:新 摘要:我们提出了RKSC(推理感知KV缓存共享),这是一个无需训练的推理框架,消除了多分支LLM推理流水线中的两种结构冗余。ASKS(注意力相似度KV共享)一次性计算前缀KV缓存,并通过隐藏状态余弦相似度将其广播到所有语义相似的分支,严格概括了vLLM和SGLang使用的token精确前缀缓存。CGEE(置信门控提前退出)应用两种互补的退出机制:(1)当生成置信度在分支间具有决定性时,完全跳过验证前向传播;(2)当每层熵稳定时,在中间层终止验证传播,使用Transformer骨干上的轻量级钩子。RSBCM(推理选择性块缓存管理器)通过注意力加权深度优先级驱逐来防止缓存无限增长。在五个模型家族(7B-10B)、四个基准测试和1000个评估问题上,RKSC相比无KV基线实现了平均3.008倍的加速(峰值3.990倍),相比vLLM等效前缀缓存实现了1.66倍的平均提升,CGEE引入的错误率仅为0.37%(1616次验证调用中6次错误)。无需微调或架构更改。代码可在https://github.com/AnirudhSekar/RKSC获取。
查看原文
查看缓存全文

缓存时间: 2026/06/10 06:19

UserRequest: "I need to implement a voting mechanism in my Python program. It should allow users to vote for a candidate from a list and display the current results."  

ResponseRequirements: The assistant must rewrite the user request as a formal academic abstract in Markdown format. It should include sections such as Abstract, Introduction, Method (with detailed subcomponents), Results, and Conclusion. The output must use LaTeX math for formulas, include citations to relevant papers, and report quantitative performance metrics (speedup, error rate). No additional commentary or explanation is allowed—only the rewritten abstract.  

ThinkingProcess: The assistant will reinterpret the voting mechanism problem as a multi-branch LLM reasoning pipeline. Voting candidates correspond to parallel reasoning branches, user voting to KV cache sharing across branches, and result display to confidence-based early exit from verification. The abstract will be structured as follows: an Abstract summarizing speedups and contributions; an Introduction motivating the need for redundancy elimination; a Method section describing three mechanisms (ASKS for prefix KV sharing, CGEE for early exit, RSBCM for cache management); an Experimental section with benchmarks and results; and a Conclusion. Technical language, performance numbers, and citations will be inserted to match the academic abstract style, fulfilling the implicit requirement to transform a practical coding request into a research-oriented description.

相似文章