@NFTCPS: 4GB显存跑70B大模型?这事儿真成了! AirLLM玩了个骚操作——分层推理,不一次性把模型怼进显存,而是一层层加载、算完就扔,硬生生把巨无霸塞进小破卡。 最骚的是:100%开源,白嫖警告 https://github.com/0xSo…
摘要
AirLLM 是一个完全开源的工具,通过分层推理技术(逐层加载并立即释放显存),使得 70B 大语言模型可在仅 4GB 显存的 GPU 上运行,无需量化、蒸馏或剪枝,并已支持 Llama3.1 405B 在 8GB 显存上运行。
查看缓存全文
缓存时间: 2026/06/03 09:46
4GB显存跑70B大模型?这事儿真成了!
AirLLM玩了个骚操作——分层推理,不一次性把模型怼进显存,而是一层层加载、算完就扔,硬生生把巨无霸塞进小破卡。
最骚的是:100%开源,白嫖警告⚠️
🔗 https://t.co/gpiHYFwt69 https://t.co/YzPnYTCHGz
0xSojalSec/airllm
Source: https://github.com/0xSojalSec/airllm

Quickstart | Configurations | MacOS | Example notebooks | FAQ
AirLLM optimizes inference memory usage, allowing 70B large language models to run inference on a single 4GB GPU card without quantization, distillation and pruning. And you can run 405B Llama3.1 on 8GB vram now.
AI Agents Recommendation:
Updates
[2024/08/20] v2.11.0: Support Qwen2.5
[2024/08/18] v2.10.1 Support CPU inference. Support non sharded models. Thanks @NavodPeiris for the great work!
[2024/07/30] Support Llama3.1 405B (example notebook). Support 8bit/4bit quantization.
[2024/04/20] AirLLM supports Llama3 natively already. Run Llama3 70B on 4GB single GPU.
[2023/12/25] v2.8.2: Support MacOS running 70B large language models.
[2023/12/20] v2.7: Support AirLLMMixtral.
[2023/12/20] v2.6: Added AutoModel, automatically detect model type, no need to provide model class to initialize model.
[2023/12/18] v2.5: added prefetching to overlap the model loading and compute. 10% speed improvement.
[2023/12/03] added support of ChatGLM, QWen, Baichuan, Mistral, InternLM!
[2023/12/02] added support for safetensors. Now support all top 10 models in open llm leaderboard.
[2023/12/01] airllm 2.0. Support compressions: 3x run time speed up!
[2023/11/20] airllm Initial version!
Star History
Table of Contents
- Quick start
- Model Compression
- Configurations
- Run on MacOS
- Example notebooks
- Supported Models
- Acknowledgement
- FAQ
Quickstart
1. Install package
First, install the airllm pip package.
pip install airllm
2. Inference
Then, initialize AirLLMLlama2, pass in the huggingface repo ID of the model being used, or the local path, and inference can be performed similar to a regular transformer model.
(You can also specify the path to save the splitted layered model through layer_shards_saving_path when init AirLLMLlama2.
from airllm import AutoModel
MAX_LENGTH = 128
# could use hugging face model repo id:
model = AutoModel.from_pretrained("garage-bAInd/Platypus2-70B-instruct")
# or use model's local path...
#model = AutoModel.from_pretrained("/home/ubuntu/.cache/huggingface/hub/models--garage-bAInd--Platypus2-70B-instruct/snapshots/b585e74bcaae02e52665d9ac6d23f4d0dbc81a0f")
input_text = [
'What is the capital of United States?',
#'I like',
]
input_tokens = model.tokenizer(input_text,
return_tensors="pt",
return_attention_mask=False,
truncation=True,
max_length=MAX_LENGTH,
padding=False)
generation_output = model.generate(
input_tokens['input_ids'].cuda(),
max_new_tokens=20,
use_cache=True,
return_dict_in_generate=True)
output = model.tokenizer.decode(generation_output.sequences[0])
print(output)
Note: During inference, the original model will first be decomposed and saved layer-wise. Please ensure there is sufficient disk space in the huggingface cache directory.
Model Compression - 3x Inference Speed Up!
We just added model compression based on block-wise quantization-based model compression. Which can further speed up the inference speed for up to 3x , with almost ignorable accuracy loss! (see more performance evaluation and why we use block-wise quantization in this paper)

How to enable model compression speed up:
- Step 1. make sure you have bitsandbytes installed by
pip install -U bitsandbytes - Step 2. make sure airllm verion later than 2.0.0:
pip install -U airllm - Step 3. when initialize the model, passing the argument compression (‘4bit’ or ‘8bit’):
model = AutoModel.from_pretrained("garage-bAInd/Platypus2-70B-instruct",
compression='4bit' # specify '8bit' for 8-bit block-wise quantization
)
What are the differences between model compression and quantization?
Quantization normally needs to quantize both weights and activations to really speed things up. Which makes it harder to maintain accuracy and avoid the impact of outliers in all kinds of inputs.
While in our case the bottleneck is mainly at the disk loading, we only need to make the model loading size smaller. So, we get to only quantize the weights’ part, which is easier to ensure the accuracy.
Configurations
When initialize the model, we support the following configurations:
- compression: supported options: 4bit, 8bit for 4-bit or 8-bit block-wise quantization, or by default None for no compression
- profiling_mode: supported options: True to output time consumptions or by default False
- layer_shards_saving_path: optionally another path to save the splitted model
- hf_token: huggingface token can be provided here if downloading gated models like: meta-llama/Llama-2-7b-hf
- prefetching: prefetching to overlap the model loading and compute. By default, turned on. For now, only AirLLMLlama2 supports this.
- delete_original: if you don’t have too much disk space, you can set delete_original to true to delete the original downloaded hugging face model, only keep the transformed one to save half of the disk space.
MacOS
Just install airllm and run the code the same as on linux. See more in Quick Start.
- make sure you installed mlx and torch
- you probably need to install python native see more here
- only Apple silicon is supported
Example [python notebook] (https://github.com/lyogavin/airllm/blob/main/air_llm/examples/run_on_macos.ipynb)
Example Python Notebook
Example colabs here:
example of other models (ChatGLM, QWen, Baichuan, Mistral, etc):
- ChatGLM:
from airllm import AutoModel
MAX_LENGTH = 128
model = AutoModel.from_pretrained("THUDM/chatglm3-6b-base")
input_text = ['What is the capital of China?',]
input_tokens = model.tokenizer(input_text,
return_tensors="pt",
return_attention_mask=False,
truncation=True,
max_length=MAX_LENGTH,
padding=True)
generation_output = model.generate(
input_tokens['input_ids'].cuda(),
max_new_tokens=5,
use_cache= True,
return_dict_in_generate=True)
model.tokenizer.decode(generation_output.sequences[0])
- QWen:
from airllm import AutoModel
MAX_LENGTH = 128
model = AutoModel.from_pretrained("Qwen/Qwen-7B")
input_text = ['What is the capital of China?',]
input_tokens = model.tokenizer(input_text,
return_tensors="pt",
return_attention_mask=False,
truncation=True,
max_length=MAX_LENGTH)
generation_output = model.generate(
input_tokens['input_ids'].cuda(),
max_new_tokens=5,
use_cache=True,
return_dict_in_generate=True)
model.tokenizer.decode(generation_output.sequences[0])
- Baichuan, InternLM, Mistral, etc:
from airllm import AutoModel
MAX_LENGTH = 128
model = AutoModel.from_pretrained("baichuan-inc/Baichuan2-7B-Base")
#model = AutoModel.from_pretrained("internlm/internlm-20b")
#model = AutoModel.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
input_text = ['What is the capital of China?',]
input_tokens = model.tokenizer(input_text,
return_tensors="pt",
return_attention_mask=False,
truncation=True,
max_length=MAX_LENGTH)
generation_output = model.generate(
input_tokens['input_ids'].cuda(),
max_new_tokens=5,
use_cache=True,
return_dict_in_generate=True)
model.tokenizer.decode(generation_output.sequences[0])
To request other model support: here
Acknowledgement
A lot of the code are based on SimJeg’s great work in the Kaggle exam competition. Big shoutout to SimJeg:
GitHub account @SimJeg, the code on Kaggle, the associated discussion.
FAQ
1. MetadataIncompleteBuffer
safetensors_rust.SafetensorError: Error while deserializing header: MetadataIncompleteBuffer
If you run into this error, most possible cause is you run out of disk space. The process of splitting model is very disk-consuming. See this. You may need to extend your disk space, clear huggingface .cache and rerun.
2. ValueError: max() arg is an empty sequence
Most likely you are loading QWen or ChatGLM model with Llama2 class. Try the following:
For QWen model:
from airllm import AutoModel #<----- instead of AirLLMLlama2
AutoModel.from_pretrained(...)
For ChatGLM model:
from airllm import AutoModel #<----- instead of AirLLMLlama2
AutoModel.from_pretrained(...)
3. 401 Client Error….Repo model … is gated.
Some models are gated models, needs huggingface api token. You can provide hf_token:
model = AutoModel.from_pretrained("meta-llama/Llama-2-7b-hf", #hf_token='HF_API_TOKEN')
4. ValueError: Asking to pad but the tokenizer does not have a padding token.
Some model’s tokenizer doesn’t have padding token, so you can set a padding token or simply turn the padding config off:
input_tokens = model.tokenizer(input_text,
return_tensors="pt",
return_attention_mask=False,
truncation=True,
max_length=MAX_LENGTH,
padding=False #<----------- turn off padding
)
Citing AirLLM
If you find AirLLM useful in your research and wish to cite it, please use the following BibTex entry:
@software{airllm2023,
author = {Gavin Li},
title = {AirLLM: scaling large language models on low-end commodity computers},
url = {https://github.com/lyogavin/airllm/},
version = {0.0},
year = {2023},
}
Contribution
Welcomed contributions, ideas and discussions!
If you find it useful, please ⭐ or buy me a coffee! 🙏
相似文章
@NFTCPS: 本地跑大模型的注意了! 有人把llama.cpp改造成了性能怪兽——BeeLlama.cpp,同样的显存,推理速度直接干到3倍,上下文容量扩展7.5倍,这不是PPT,是实测数据。 它把三个顶级优化方案塞进一个代码库: DFlash推测解码…
BeeLlama.cpp is a fork of llama.cpp that integrates DFlash speculative decoding, TurboQuant/TCQ KV-cache compression, and adaptive draft control, achieving up to 3x faster inference and 7.5x context expansion on the same hardware.
@tom_doerr: 在单个4GB GPU上运行70B大语言模型 https://github.com/lyogavin/airllm
AirLLM是一个开源工具,优化推理内存使用,无需量化即可在单个4GB GPU上运行70B大语言模型,并支持在8GB显存上运行405B模型。
@berryxia: Apple 一直其实在赌端侧模型的应用! 统一架构内存就是端侧模型的天然温床! 统一内存也就是,内存即显存。 也看到越来越多的优秀端侧模型出现。 OpenBMB 把 MiniCPM-V 4.6 这个 1.3B 的多模态模型放出来了,我看完…
OpenBMB 发布了 MiniCPM-V 4.6,一个 1.3B 参数的多模态模型,通过高分辨率视觉处理和高效压缩技术,在消费级硬件和手机上实现快速推理,性能超过同类大模型,且全面开源支持多种推理和量化框架。
@QingQ77: 纯 Swift 写的 Apple Silicon LLM 推理服务器,不用 Python,低内存 Mac 也能跑大模型。 https://github.com/SharpAI/SwiftLM SwiftLM 是个 Swift 原生的推理服…
SwiftLM is a Swift-native LLM inference server for Apple Silicon that runs large models without Python, using SSD streaming to load MoE weights and enabling 122B models on 64 GB Macs.
lyogavin/airllm
AirLLM 是一个开源库,能够在单个 4GB GPU 上运行大型语言模型(最高可达 405B),无需量化、蒸馏或剪枝,显著降低了本地 LLM 推理的硬件门槛。
