lyogavin/airllm

GitHub Trending (daily) 工具

摘要

AirLLM 是一个开源库，能够在单个 4GB GPU 上运行大型语言模型（最高可达 405B），无需量化、蒸馏或剪枝，显著降低了本地 LLM 推理的硬件门槛。

AirLLM 70B 推理，仅需单个 4GB GPU

查看原文

查看缓存全文

缓存时间: 2026/06/03 21:37

lyogavin/airllm 来源：https://github.com/lyogavin/airllm airllm_logo 快速开始 | 配置 | MacOS | 示例笔记本 | 常见问题 AirLLM 优化了推理内存使用，使得70B大语言模型可以在单张4GB GPU上运行推理，无需量化、蒸馏和剪枝。现在你可以在8GB显存上运行 405B Llama3.1。 GitHub 仓库星标下载量 (https://pepy.tech/project/airllm) 代码许可 (https://github.com/LianjiaTech/BELLE/blob/main/LICENSE) 通用徽章 (https://static.aicompose.cn/static/wecom_barcode.png?t=1671918938) Discord (https://discord.gg/2xffU5sn) PyPI - AirLLM (https://pypi.org/project/airllm/) 网站 (https://medium.com/@lyo.gavin) 网站 (https://gavinliblog.com) 在 Patreon 上支持我 (https://patreon.com/gavinli) GitHub 赞助者 (https://github.com/sponsors/lyogavin) ## AI Agent 推荐： * 最佳 AI 游戏精灵生成器 (https://godmodeai.co) * 最佳 AI 面部表情编辑器 (https://crazyfaceai.com) ## 更新 [2024/08/20] v2.11.0：支持 Qwen2.5 [2024/08/18] v2.10.1：支持 CPU 推理。支持非分片模型。感谢 @NavodPeiris 的出色工作！ [2024/07/30] 支持 Llama3.1 405B（示例笔记本 (https://colab.research.google.com/github/lyogavin/airllm/blob/main/air_llm/examples/run_llama3.1_405B.ipynb)）。支持 8bit/4bit 量化。 [2024/04/20] AirLLM 已原生支持 Llama3。在 4GB 单 GPU 上运行 Llama3 70B。 [2023/12/25] v2.8.2：支持 MacOS 运行 70B 大语言模型。 [2023/12/20] v2.7：支持 AirLLMMixtral。 [2023/12/20] v2.6：添加了 AutoModel，自动检测模型类型，无需提供模型类来初始化模型。 [2023/12/18] v2.5：添加了预取功能，以重叠模型加载和计算。速度提升 10%。 [2023/12/03] 增加了对 ChatGLM、QWen、Baichuan、Mistral、InternLM 的支持！ [2023/12/02] 增加了对 safetensors 的支持。现在支持开放大语言模型排行榜中的前 10 个模型。 [2023/12/01] AirLLM 2.0。支持压缩：运行时速度提升 3 倍！ [2023/11/20] AirLLM 初始版本！ ## 星历星历图表 (https://star-history.com/#lyogavin/airllm&Timeline) ## 目录 * 快速开始 * 模型压缩 * 配置 * 在 MacOS 上运行 * 示例笔记本 * 支持的模型 * 致谢 * 常见问题 ## 快速开始 ### 1. 安装包首先，安装 airllm pip 包。 `bash pip install airllm` ### 2. 推理然后，初始化 AirLLMLlama2，传入所使用的模型的 Hugging Face 仓库 ID 或本地路径，推理方式类似于普通 Transformer 模型。（在初始化 AirLLMLlama2 时，也可以通过 layer_shards_saving_path 指定保存分割后分层模型的路径。） python from airllm import AutoModel MAX_LENGTH = 128 # 可以使用 Hugging Face 模型仓库 ID： model = AutoModel.from_pretrained("garage-bAInd/Platypus2-70B-instruct") # 或使用模型的本地路径... #model = AutoModel.from_pretrained("/home/ubuntu/.cache/huggingface/hub/models--garage-bAInd--Platypus2-70B-instruct/snapshots/b585e74bcaae02e52665d9ac6d23f4d0dbc81a0f") input_text = [ 'What is the capital of United States?', #'I like', ] input_tokens = model.tokenizer(input_text, return_tensors="pt", return_attention_mask=False, truncation=True, max_length=MAX_LENGTH, padding=False) generation_output = model.generate( input_tokens['input_ids'].cuda(), max_new_tokens=20, use_cache=True, return_dict_in_generate=True) output = model.tokenizer.decode(generation_output.sequences[0]) print(output) 注意：推理过程中，原始模型将首先被分解并按层保存。请确保 Hugging Face 缓存目录有足够的磁盘空间。 ## 模型压缩 - 推理速度提升 3x！我们刚刚增加了基于块级量化模型压缩的压缩功能。这可以进一步将推理速度提升至 3 倍，且几乎可以忽略的精度损失！（更多性能评估以及为什么使用块级量化，请参见这篇论文 (https://arxiv.org/abs/2212.09720)） speed_improvement #### 如何启用模型压缩加速： * 步骤 1：确保已通过 `pip install -U bitsandbytes` 安装了 bitsandbytes (https://github.com/TimDettmers/bitsandbytes) * 步骤 2：确保 airllm 版本高于 2.0.0：`pip install -U airllm` * 步骤 3：在初始化模型时，传入参数 compression（‘4bit’ 或 ‘8bit’）： `python model = AutoModel.from_pretrained("garage-bAInd/Platypus2-70B-instruct", compression='4bit' # 指定 '8bit' 进行 8 位块级量化 )` #### 模型压缩与量化有何区别？量化通常需要同时量化权重和激活才能显著加快速度。这使得维持精度并避免各种输入中异常值的影响更加困难。而在我们的场景中，瓶颈主要在于磁盘加载，我们只需要让模型加载的尺寸更小。因此，我们只需量化权重部分，这更容易保证精度。 ## 配置在初始化模型时，我们支持以下配置： * compression：支持选项：4bit、8bit 用于 4 位或 8 位块级量化，默认 None 表示不压缩 * profiling_mode：支持选项：True 输出时间消耗，默认 False * layer_shards_saving_path：可选地指定另一个路径来保存分割后的模型 * hf_token：如果下载像 meta-llama/Llama-2-7b-hf 这样的受限模型，可以在此提供 Hugging Face 令牌 * prefetching：预取以重叠模型加载和计算。默认开启。目前仅 AirLLMLlama2 支持此功能。 * delete_original：如果磁盘空间不足，可以将 delete_original 设为 true，删除原始下载的 Hugging Face 模型，只保留转换后的模型，以节省一半磁盘空间。 ## MacOS 只需安装 airllm，然后像在 Linux 上一样运行代码。更多信息请参见快速开始。 * 确保已安装 mlx (https://github.com/ml-explore/mlx?tab=readme-ov-file#installation) 和 torch * 你可能需要安装 python 原生包，更多信息请参见此处 (https://stackoverflow.com/a/65432861/21230266) * 仅支持 Apple Silicon (https://support.apple.com/en-us/HT211814) 示例 [Python 笔记本] (https://github.com/lyogavin/airllm/blob/main/air_llm/examples/run_on_macos.ipynb) ## 示例 Python 笔记本示例 Colab 如下： #### 其他模型的示例（ChatGLM、QWen、Baichuan、Mistral 等）： * ChatGLM： `python from airllm import AutoModel MAX_LENGTH = 128 model = AutoModel.from_pretrained("THUDM/chatglm3-6b-base") input_text = ['What is the capital of China?',] input_tokens = model.tokenizer(input_text, return_tensors="pt", return_attention_mask=False, truncation=True, max_length=MAX_LENGTH, padding=True) generation_output = model.generate( input_tokens['input_ids'].cuda(), max_new_tokens=5, use_cache= True, return_dict_in_generate=True) model.tokenizer.decode(generation_output.sequences[0])` * QWen： `python from airllm import AutoModel MAX_LENGTH = 128 model = AutoModel.from_pretrained("Qwen/Qwen-7B") input_text = ['What is the capital of China?',] input_tokens = model.tokenizer(input_text, return_tensors="pt", return_attention_mask=False, truncation=True, max_length=MAX_LENGTH) generation_output = model.generate( input_tokens['input_ids'].cuda(), max_new_tokens=5, use_cache=True, return_dict_in_generate=True) model.tokenizer.decode(generation_output.sequences[0])` * Baichuan、InternLM、Mistral 等： python from airllm import AutoModel MAX_LENGTH = 128 model = AutoModel.from_pretrained("baichuan-inc/Baichuan2-7B-Base") #model = AutoModel.from_pretrained("internlm/internlm-20b") #model = AutoModel.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1") input_text = ['What is the capital of China?',] input_tokens = model.tokenizer(input_text, return_tensors="pt", return_attention_mask=False, truncation=True, max_length=MAX_LENGTH) generation_output = model.generate( input_tokens['input_ids'].cuda(), max_new_tokens=5, use_cache=True, return_dict_in_generate=True) model.tokenizer.decode(generation_output.sequences[0]) #### 请求支持其他模型：请点击此处 (https://docs.google.com/forms/d/e/1FAIpQLSe0Io9ANMT964Zi-OQOq1TJmnvP-G3_ZgQDhP7SatN0IEdbOg/viewform?usp=sf_link) ## 致谢许多代码基于 SimJeg 在 Kaggle 考试比赛中的出色工作。特别感谢 SimJeg：GitHub 账号 @SimJeg (https://github.com/SimJeg)，Kaggle 上的代码 (https://www.kaggle.com/code/simjeg/platypus2-70b-with-wikipedia-rag)，以及相关讨论 (https://www.kaggle.com/competitions/kaggle-llm-science-exam/discussion/446414)。 ## 常见问题 ### 1. MetadataIncompleteBuffer safetensors_rust.SafetensorError: Error while deserializing header: MetadataIncompleteBuffer 如果遇到此错误，最可能的原因是磁盘空间不足。拆分模型的过程非常消耗磁盘。请参见此讨论 (https://huggingface.co/TheBloke/guanaco-65B-GPTQ/discussions/12)。你可能需要扩展磁盘空间，清除 Hugging Face 缓存 (https://huggingface.co/docs/datasets/cache)，然后重新运行。 ### 2. ValueError: max() arg is an empty sequence 很可能你在使用 Llama2 类加载 QWen 或 ChatGLM 模型。请尝试以下方法：对于 QWen 模型： `python from airllm import AutoModel #<----- 而不是 AirLLMLlama2 AutoModel.from_pretrained(...)` 对于 ChatGLM 模型： `python from airllm import AutoModel #<----- 而不是 AirLLMLlama2 AutoModel.from_pretrained(...)` ### 3. 401 Client Error….Repo model … is gated. 某些模型是受限模型，需要 Hugging Face API 令牌。你可以提供 hf_token： `python model = AutoModel.from_pretrained("meta-llama/Llama-2-7b-hf", #hf_token='HF_API_TOKEN')` ### 4. ValueError: Asking to pad but the tokenizer does not have a padding token. 某些模型的标记器没有填充标记，因此你可以设置一个填充标记，或者直接关闭填充配置： `python input_tokens = model.tokenizer(input_text, return_tensors="pt", return_attention_mask=False, truncation=True, max_length=MAX_LENGTH, padding=False #<----------- 关闭填充 )` ## 引用 AirLLM 如果你发现 AirLLM 在你的研究中很有用并希望引用它，请使用以下 BibTex 条目： `@software{airllm2023, author = {Gavin Li}, title = {AirLLM: scaling large language models on low-end commodity computers}, url = {https://github.com/lyogavin/airllm/}, version = {0.0}, year = {2023}, }` ## 贡献欢迎贡献、想法和讨论！如果你觉得有用，请⭐或请我喝杯咖啡！🙏 “Buy Me A Coffee” (https://bmc.link/lyogavinQ)

相似文章

@tom_doerr: 在单个4GB GPU上运行70B大语言模型 https://github.com/lyogavin/airllm

X AI KOLs Timeline

AirLLM是一个开源工具，优化推理内存使用，无需量化即可在单个4GB GPU上运行70B大语言模型，并支持在8GB显存上运行405B模型。

vllm-project/vllm v0.19.1

GitHub Releases Watchlist

vLLM v0.19.1 发布 - 一个快速易用的开源 LLM 推理和服务库，拥有业界领先的吞吐量，支持 200+ 个模型架构以及包括 NVIDIA/AMD GPU 和 CPU 在内的多样化硬件。

@ErickSky: 忘掉vLLM、llama.cpp和昂贵的GPU吧。[colibri] 这个工具用纯C语言在约25GB RAM上运行GLM-5.2 (744B MoE)…

X AI KOLs Timeline

colibri 是一个纯C语言推理工具，通过从磁盘流式传输专家模型，在约25GB RAM上运行GLM-5.2 744B MoE模型，无需昂贵的GPU。

当你没有数据中心GPU时

Reddit r/LocalLLaMA

LiquidAI 发布了 LFM2.5-230M，一个 230M 参数的语言模型，专为在有限硬件上运行而设计，支持 transformers、vLLM 和 SGLang。

ggml-org/llama.cpp

GitHub Trending (daily)

llama.cpp 是一个开源 C/C++ 库，用于在本地硬件上高效运行 LLM 推理，支持多种量化方法和多后端（CPU、GPU 等）。