@tom_doerr: 在单个4GB GPU上运行70B大语言模型 https://github.com/lyogavin/airllm

X AI KOLs Timeline 2026/05/13 04:01 工具

摘要

AirLLM是一个开源工具，优化推理内存使用，无需量化即可在单个4GB GPU上运行70B大语言模型，并支持在8GB显存上运行405B模型。

在单个4GB GPU上运行70B大语言模型 https://t.co/TaxtusXkb6 https://t.co/UFoIV1UkyC

查看缓存全文

缓存时间: 2026/05/13 18:24

Runs 70B LLMs on single 4GB GPU https://t.co/TaxtusXkb6 https://t.co/UFoIV1UkyC — # lyogavin/airllm 来源：https://github.com/lyogavin/airllm airllm_logo 快速开始 | 配置 | MacOS | 示例 Notebook | 常见问题 AirLLM 优化了推理内存使用，使得 70B 的大语言模型可以在单张 4GB GPU 上运行推理，无需量化、蒸馏或剪枝。现在你甚至可以在 8GB 显存 上运行 405B Llama3.1。 GitHub 仓库星星数下载量 (https://pepy.tech/project/airllm) 代码许可 (https://github.com/LianjiaTech/BELLE/blob/main/LICENSE) 通用徽章 (https://static.aicompose.cn/static/wecom_barcode.png?t=1671918938) Discord (https://discord.gg/2xffU5sn) PyPI - AirLLM (https://pypi.org/project/airllm/) 网站 (https://medium.com/@lyo.gavin) 网站 (https://gavinliblog.com) 在 Patreon 上支持我 (https://patreon.com/gavinli) GitHub Sponsors (https://github.com/sponsors/lyogavin) ## AI 代理推荐： * 最佳 AI 游戏精灵生成器 (https://godmodeai.co) * 最佳 AI 面部表情编辑器 (https://crazyfaceai.com) ## 更新 [2024/08/20] v2.11.0：支持 Qwen2.5 [2024/08/18] v2.10.1：支持 CPU 推理。支持非分片模型。感谢 @NavodPeiris 的出色工作！ [2024/07/30] 支持 Llama3.1 405B（示例 notebook (https://colab.research.google.com/github/lyogavin/airllm/blob/main/air_llm/examples/run_llama3.1_405B.ipynb)）。支持 8bit/4bit 量化。 [2024/04/20] AirLLM 现已原生支持 Llama3。在单张 4GB GPU 上运行 Llama3 70B。 [2023/12/25] v2.8.2：支持 MacOS 运行 70B 大语言模型。 [2023/12/20] v2.7：支持 AirLLMMixtral。 [2023/12/20] v2.6：新增 AutoModel，自动检测模型类型，无需提供模型类来初始化模型。 [2023/12/18] v2.5：新增预取功能，重叠模型加载与计算，速度提升 10%。 [2023/12/03] 新增对 ChatGLM、QWen、Baichuan、Mistral、InternLM 的支持！ [2023/12/02] 新增对 safetensors 的支持。现已支持开放 LLM 排行榜前 10 的模型。 [2023/12/01] airllm 2.0。支持压缩：运行速度提升 3 倍！ [2023/11/20] airllm 初始版本！ ## Star 历史 Star 历史图 (https://star-history.com/#lyogavin/airllm&Timeline) ## 目录 * 快速开始 * 模型压缩 * 配置 * 在 MacOS 上运行 * 示例 Notebook * 支持的模型 * 致谢 * 常见问题 ## 快速开始 ### 1. 安装包首先，安装 airllm pip 包。 bash pip install airllm ### 2. 推理然后，初始化 AirLLMLlama2，传入所使用的 Hugging Face 仓库 ID 或本地路径，推理方式与普通 Transformer 模型类似。（你也可以在初始化 AirLLMLlama2 时通过 layer_shards_saving_path 指定分片分层模型的保存路径。） python from airllm import AutoModel MAX_LENGTH = 128 # 可以使用 Hugging Face 模型仓库 ID： model = AutoModel.from_pretrained("garage-bAInd/Platypus2-70B-instruct") # 或者使用模型本地路径... #model = AutoModel.from_pretrained("/home/ubuntu/.cache/huggingface/hub/models--garage-bAInd--Platypus2-70B-instruct/snapshots/b585e74bcaae02e52665d9ac6d23f4d0dbc81a0f") input_text = [ 'What is the capital of United States?', #'I like', ] input_tokens = model.tokenizer(input_text, return_tensors="pt", return_attention_mask=False, truncation=True, max_length=MAX_LENGTH, padding=False) generation_output = model.generate( input_tokens['input_ids'].cuda(), max_new_tokens=20, use_cache=True, return_dict_in_generate=True) output = model.tokenizer.decode(generation_output.sequences[0]) print(output) 注意：推理过程中，原始模型将首先被分解并逐层保存。请确保 Hugging Face 缓存目录有足够的磁盘空间。 ## 模型压缩 - 推理速度提升 3 倍！我们刚刚增加了基于分块量化的模型压缩功能。这可以进一步提升推理速度达3 倍，且精度损失几乎可以忽略！（更多性能评估及为何采用分块量化，请参见此论文 (https://arxiv.org/abs/2212.09720)） speed_improvement #### 如何启用模型压缩加速： * 步骤 1：确保已安装 bitsandbytes (https://github.com/TimDettmers/bitsandbytes)，可通过 pip install -U bitsandbytes 安装 * 步骤 2：确保 airllm 版本高于 2.0.0：pip install -U airllm * 步骤 3：初始化模型时，传入参数 compression（‘4bit’ 或 ‘8bit’）： python model = AutoModel.from_pretrained("garage-bAInd/Platypus2-70B-instruct", compression='4bit' # 指定 '8bit' 为 8 位分块量化 ) #### 模型压缩与量化有何区别？量化通常需要同时量化权重和激活值才能真正加速，这使得维持精度和避免各种输入中异常值的影响变得更加困难。而在我们的场景中，瓶颈主要在磁盘加载，我们只需减小模型加载的大小。因此，我们只需量化权重部分，这更容易确保精度。 ## 配置初始化模型时，支持以下配置： * compression：支持选项：4bit、8bit 用于 4 位或 8 位分块量化，默认为 None（不压缩） * profiling_mode：支持选项：True 输出时间消耗，默认为 False * layer_shards_saving_path：可选的另一路径用于保存分片模型 * hf_token：下载受限模型（如 meta-llama/Llama-2-7b-hf）时，可提供 Hugging Face Token * prefetching：预取功能，重叠模型加载与计算。默认开启。目前仅 AirLLMLlama2 支持此功能。 * delete_original：如果磁盘空间不足，可以设置 delete_original 为 true，删除原始下载的 Hugging Face 模型，仅保留转换后的模型，节省一半磁盘空间。 ## MacOS 只需安装 airllm，代码运行方式与 Linux 相同。详见快速开始。 * 确保已安装 mlx (https://github.com/ml-explore/mlx?tab=readme-ov-file#installation) 和 torch * 你可能需要安装 Python 原生版本，更多信息见此处 (https://stackoverflow.com/a/65432861/21230266) * 仅支持 Apple silicon (https://support.apple.com/en-us/HT211814) 示例 [python notebook] (https://github.com/lyogavin/airllm/blob/main/air_llm/examples/run_on_macos.ipynb) ## 示例 Python Notebook 以下为示例 colab： #### 其他模型示例（ChatGLM、QWen、Baichuan、Mistral 等）： * ChatGLM： python from airllm import AutoModel MAX_LENGTH = 128 model = AutoModel.from_pretrained("THUDM/chatglm3-6b-base") input_text = ['What is the capital of China?',] input_tokens = model.tokenizer(input_text, return_tensors="pt", return_attention_mask=False, truncation=True, max_length=MAX_LENGTH, padding=True) generation_output = model.generate( input_tokens['input_ids'].cuda(), max_new_tokens=5, use_cache= True, return_dict_in_generate=True) model.tokenizer.decode(generation_output.sequences[0]) * QWen： python from airllm import AutoModel MAX_LENGTH = 128 model = AutoModel.from_pretrained("Qwen/Qwen-7B") input_text = ['What is the capital of China?',] input_tokens = model.tokenizer(input_text, return_tensors="pt", return_attention_mask=False, truncation=True, max_length=MAX_LENGTH) generation_output = model.generate( input_tokens['input_ids'].cuda(), max_new_tokens=5, use_cache=True, return_dict_in_generate=True) model.tokenizer.decode(generation_output.sequences[0]) * Baichuan、InternLM、Mistral 等： python from airllm import AutoModel MAX_LENGTH = 128 model = AutoModel.from_pretrained("baichuan-inc/Baichuan2-7B-Base") #model = AutoModel.from_pretrained("internlm/internlm-20b") #model = AutoModel.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1") input_text = ['What is the capital of China?',] input_tokens = model.tokenizer(input_text, return_tensors="pt", return_attention_mask=False, truncation=True, max_length=MAX_LENGTH) generation_output = model.generate( input_tokens['input_ids'].cuda(), max_new_tokens=5, use_cache=True, return_dict_in_generate=True) model.tokenizer.decode(generation_output.sequences[0]) #### 请求支持其他模型：请访问此处 (https://docs.google.com/forms/d/e/1FAIpQLSe0Io9ANMT964Zi-OQOq1TJmnvP-G3_ZgQDhP7SatN0IEdbOg/viewform?usp=sf_link) ## 致谢本代码大量基于 SimJeg 在 Kaggle 考试竞赛中的出色工作。特别感谢 SimJeg：GitHub 账号 @SimJeg (https://github.com/SimJeg)，Kaggle 代码 (https://www.kaggle.com/code/simjeg/platypus2-70b-with-wikipedia-rag)，相关讨论 (https://www.kaggle.com/competitions/kaggle-llm-science-exam/discussion/446414)。 ## 常见问题 ### 1. MetadataIncompleteBuffer 错误 safetensors_rust.SafetensorError: Error while deserializing header: MetadataIncompleteBuffer 如果遇到此错误，最可能的原因是磁盘空间不足。模型分割过程非常消耗磁盘空间。参见此处 (https://huggingface.co/TheBloke/guanaco-65B-GPTQ/discussions/12)。你可能需要扩展磁盘空间，清理 Hugging Face .cache (https://huggingface.co/docs/datasets/cache) 并重新运行。 ### 2. ValueError: max() arg is an empty sequence 很可能是你使用 Llama2 类加载了 QWen 或 ChatGLM 模型。请尝试以下操作：对于 QWen 模型： python from airllm import AutoModel #<----- 改为使用 AutoModel，而非 AirLLMLlama2 AutoModel.from_pretrained(...) 对于 ChatGLM 模型： python from airllm import AutoModel #<----- 改为使用 AutoModel，而非 AirLLMLlama2 AutoModel.from_pretrained(...) ### 3. 401 Client Error….Repo model … is gated. 某些模型是受限模型，需要 Hugging Face API Token。你可以提供 hf_token： python model = AutoModel.from_pretrained("meta-llama/Llama-2-7b-hf", #hf_token='HF_API_TOKEN') ### 4. ValueError: Asking to pad but the tokenizer does not have a padding token. 某些模型的 tokenizer 没有填充 token，你可以设置填充 token 或直接关闭填充配置： python input_tokens = model.tokenizer(input_text, return_tensors="pt", return_attention_mask=False, truncation=True, max_length=MAX_LENGTH, padding=False #<----------- 关闭填充 ) ## 引用 AirLLM 如果你觉得 AirLLM 在你的研究中很有用并希望引用它，请使用以下 BibTex 条目： @software{airllm2023, author = {Gavin Li}, title = {AirLLM: scaling large language models on low-end commodity computers}, url = {https://github.com/lyogavin/airllm/}, version = {0.0}, year = {2023}, } ## 贡献欢迎贡献、想法和讨论！如果你觉得有用，请 ⭐ 或请我喝杯咖啡！🙏 “Buy Me A Coffee” (https://bmc.link/lyogavinQ)

@tom_doerr: 在单个4GB GPU上运行70B大语言模型 https://github.com/lyogavin/airllm

相似文章

@tom_doerr: 在 16GB 内存 Mac 上运行 35B 模型 https://github.com/walter-grace/mac-code…

在单个16GB GPU + 64GB RAM上的本地LLM自动补全与代理式编码

@DivyanshT91162: 本地 LLM 迈入了一个全新的阶段。这个 Hugging Face 的发布简直是疯了：“gpt-oss-20b-tq3” 这是一个官方的 200 亿+ …

vllm-project/vllm v0.19.1

@oliviscusAI: 有人刚刚开发了一款工具，可以准确告诉你哪些大语言模型可以在你的硬件上运行。它会扫描你的内存、CPU 和 GPU……

提交意见反馈