@FakeMaidenMaker: 炸裂!这个开源项目能给自部署的大模型推理大幅提速、还省显存 GitHub 狂揽 9.2K star,已经加入 PyTorch 基金会,NVIDIA 的 Dynamo 也集成了它。 GitHub:https://github.com/LMC…
摘要
LMCache 是一个 KV 缓存管理层,通过缓存并复用 KV cache 来加速大模型推理、降低显存消耗,已获 9.2K star 并加入 PyTorch 基金会,被 NVIDIA Dynamo 集成。
查看缓存全文
缓存时间: 2026/06/18 20:20
炸裂!这个开源项目能给自部署的大模型推理大幅提速、还省显存
GitHub 狂揽 9.2K star,已经加入 PyTorch 基金会,NVIDIA 的 Dynamo 也集成了它。
GitHub:https://github.com/LMCache/LMCache
自己用 vLLM 部署大模型的人会遇到一个浪费:
多轮对话、RAG、长文档这些场景,模型每次都要把前面那一大段提示重新算一遍 KV cache,又慢又烧显卡。
LMCache 干的事是把这些算过的 KV cache 存下来反复用:
从 GPU 显存一路往下放到内存、本地硬盘、甚至 Redis 和 S3,下次遇到相同的上下文直接取出来,不用重算。
结果就是首字响应(TTFT)明显变快、整体吞吐上去了,长上下文的 agent 应用尤其受益。
它还不绑定具体引擎,换推理框架、换存储后端,存下来的缓存都能接着用。
部署也简单,pip install lmcache 装上就能接进去。
用了 LMCache 同样的卡,能多扛不少请求。
LMCache/LMCache
Source: https://github.com/LMCache/LMCache
A KV Cache Management Layer for Scalable LLM Inference
Blog | Documentation | Join Slack | Community Meeting | Roadmap
Updates
- [2026/05] 🔥 Agentic workload benchmark on AMD MI300X (blog).
- [2026/04] 🔥 LMCache’s new multiprocess(MP) architecture release (blog).
- [2026/03] LMCache at GTC 2026 (post).
- [2026/01] LMCache multi-node P2P CPU memory sharing, from experimental feature to production (blog).
More
- [2025/11] LMCache x CoreWeave accelerate efficient LLM inference for Cohere (blog).
- [2025/10] LMCache joins the PyTorch Foundation and Tensormesh unveiled (blog, PyTorch).
- [2025/09] NVIDIA Dynamo integrates LMCache, accelerating LLM inference (blog).
- [2025/08] 🎉 LMCache hits 5,000+ GitHub stars (blog).
- [2025/08] LMCache supports gpt-oss (20B/120B) on day 1 (blog).
- [2025/07] Get faster LLM inference and cheaper responses with LMCache and Redis (Redis blog).
- [2025/07] LMCache extends its turbo-boost to multimodal models in vLLM V1 (blog).
- [2025/06] LLM Production Stack goes cross-hardware: AMD, Arm and Ascend (blog).
About
LMCache is a KV cache management layer for LLM inference. It turns KV cache from a temporary state into reusable AI-native knowledge that can be stored persistently, reused across multiple serving engines, monitored with an observability stack, and transformed for better generation quality. As a result, LMCache reduces TTFT (time-to-first-token) and improves throughput, especially for long-context agentic, multi-turn conversation, and knowledge-augmented workloads (e.g., RAG).
LMCache is vendor-neutral. It can be used as a KV cache layer for a range of mainstream open-source serving engines, inference frameworks, hardware vendors, storage systems, and infrastructure providers. The vendor neutrality allows users to freely switch between serving engines and storage vendors, while reusing the stored KV caches.
Key features
-
Engine-independent deployment: LMCache, as a standalone daemon process, manages KV cache independently from the inference engine process, so that KV cache will not be lost even if the inference engine crashes (i.e., no fate-sharing with engines).
-
Persistent, tiered KV cache offloading and reuse: Move KV caches out of GPU memory into a tiered storage hierarchy spanning CPU memory, local storage, and remote backends, enabling reuse across requests, sessions, and engine instances to reduce repeated prefill computation and improve TTFT.
-
Production-level KV cache observability: LMCache provides a rich set of KV cache observability metrics, including typical Kubernetes metrics (health monitoring, performance diagnostics), KV-cache-specific metrics (request-level and token-level prefix cache hits, lifecycle, request-level KV cache performance), management metrics (user-specific usage), and more.
-
Pluggable storage and transport backends: Easily integrate remote storage and KV transfer backends through a unified interface, enabling KV cache offloading and sharing across storage providers. Through this interface, LMCache supports storage backends including CPU RAM, local disk (SSD), Redis/Valkey, Mooncake, InfiniStore, S3-compatible object storage, NIXL, and GDS.
-
Non-prefix KV reuse: Extend KV reuse beyond prefix caching by reusing cached KV blocks at any position in the prompt. This leverages CacheBlend to selectively recompute tokens for quality recovery.
-
PD disaggregation and KV transfer: Support KV cache transfer from prefill workers to decode workers over NVLink, RDMA, or TCP through transport layers such as NIXL.
-
Pluggable KV transformation: A simple interface for researchers to write compression, token dropping, and custom serialization through a flexible SERDE interface.
LMCache is becoming an integral layer in the LLM inference ecosystem, with community-driven integration with serving engines, inference frameworks, hardware vendors, storage systems, and infrastructure providers:
Getting Started
To use LMCache, simply install lmcache from your package manager, e.g. pip:
pip install lmcache
For more setup options and examples, see:
Contributing
We welcome and value contributions and collaborations. Join us in improving LMCache. Check out the Contributing Guide or join our Slack community to get started.
Adoption and Partnerships
LMCache has a growing community of developers, researchers, industry adopters, and partners building the next generation of efficient LLM inference systems.
As an independent open-source project, LMCache is becoming the de-facto standard for KV Cache management in LLM inference. Its continued development and community work are supported in part by Tensormesh.
Citation
LMCache builds on research in KV cache management, including cache reuse, offloading, compression, and serving optimization. If you use LMCache in your research, please cite the LMCache paper and related work.
@article{cheng2025lmcache,
title={LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference},
author={Cheng, Yihua and Liu, Yuhan and Yao, Jiayi and An, Yuwei and Chen, Xiaokun and Feng, Shaoting and Huang, Yuyang and Shen, Samuel and Du, Kuntai and Jiang, Junchen},
journal={arXiv preprint arXiv:2510.09665},
year={2025}
}
Related papers
@inproceedings{liu2024cachegen,
title={Cachegen: Kv cache compression and streaming for fast large language model serving},
author={Liu, Yuhan and Li, Hanchen and Cheng, Yihua and Ray, Siddhant and Huang, Yuyang and Zhang, Qizheng and Du, Kuntai and Yao, Jiayi and Lu, Shan and Ananthanarayanan, Ganesh and others},
booktitle={Proceedings of the ACM SIGCOMM 2024 Conference},
pages={38--56},
year={2024}
}
@inproceedings{yao2025cacheblend,
title={Cacheblend: Fast large language model serving for rag with cached knowledge fusion},
author={Yao, Jiayi and Li, Hanchen and Liu, Yuhan and Ray, Siddhant and Cheng, Yihua and Zhang, Qizheng and Du, Kuntai and Lu, Shan and Jiang, Junchen},
booktitle={Proceedings of the twentieth European conference on computer systems},
pages={94--109},
year={2025}
}
License
The LMCache codebase is licensed under Apache License 2.0. See the LICENSE file for details.
Ren (@FakeMaidenMaker): 在自己电脑上白嫖大模型,不花一分钱、数据绝对安全
这 5 个工具助你白嫖到底:
1、Ollama(174k star)
本地跑模型的标准。
一行命令拉一个模型下来就能用,Llama、Qwen、DeepSeek 都支持,还给别的程序留了接口。
https://t.co/87SBr6aXss
2、LM Studio
相似文章
@NFTCPS: 4GB显存跑70B大模型?这事儿真成了! AirLLM玩了个骚操作——分层推理,不一次性把模型怼进显存,而是一层层加载、算完就扔,硬生生把巨无霸塞进小破卡。 最骚的是:100%开源,白嫖警告 https://github.com/0xSo…
AirLLM 是一个完全开源的工具,通过分层推理技术(逐层加载并立即释放显存),使得 70B 大语言模型可在仅 4GB 显存的 GPU 上运行,无需量化、蒸馏或剪枝,并已支持 Llama3.1 405B 在 8GB 显存上运行。
@QingQ77: 用纯 Rust 实现 LLM 推理引擎,针对每种硬件×模型×量化组合定制 CUDA 内核,跑出比 vLLM 和 TensorRT-LLM 更高的推理速度。 https://github.com/Avarok-Cybersecurity/a…
Atlas 是一个纯 Rust 实现的 LLM 推理引擎,通过为每种硬件×模型×量化组合定制 CUDA 内核,实现了比 vLLM 和 TensorRT-LLM 更快的推理速度。
@cevenif: 用苹果电脑跑本地大模型的朋友,有个工具值得盯上——Rapid-MLX。它在 M 系列芯片上的推理速度比 Ollama 快 2 到 4 倍,因为它是直接基于苹果的 MLX 框架开发的,对芯片架构的压榨更彻底。 几个关键点: KV 缓存裁剪加…
Rapid-MLX 是一个针对苹果 M 系列芯片优化的本地大模型推理工具,基于 MLX 框架开发,推理速度比 Ollama 快 2 到 4 倍,支持多种模型、工具调用及 OpenAI API 兼容接口。
@XAMTO_AI: ControlNet作者敏神又搞出新东西了! 新开源的FramePack直接把视频生成的门槛打了下来——6GB显存就能跑,13B模型生成1分钟30帧视频,在RTX 4090上只要1.5秒出一帧,这配置要求放以前根本不敢想。 核心思路是逐帧…
ControlNet作者敏神开源了FramePack视频生成模型,仅需6GB显存即可运行13B模型,生成1分钟30帧视频,RTX 4090上每帧1.5秒,并提供Windows一键包。
@Michaelzsguo: https://x.com/Michaelzsguo/status/2053217839729791221
本文是一份本地大模型部署指南,涵盖硬件选择、内存计算、Runtime 工具对比及模型量化选择,帮助用户从入门到优化本地推理体验。