@NFTCPS: 4GB VRAM running 70B large model? It actually works! AirLLM did a clever trick — layered inference, not loading the whole model into VRAM at once, but layer by layer, compute and discard, squeezing the giant into a small GPU. The best part: 100% open source, freebie warning https://github.com/0xSo…
Summary
AirLLM is a fully open-source tool that uses layered inference (loading and releasing VRAM layer by layer) to enable 70B large language models to run on GPUs with only 4GB VRAM, without quantization, distillation, or pruning. It already supports running Llama3.1 405B on 8GB VRAM.
View Cached Full Text
Cached at: 06/03/26, 09:46 AM
Running a 70B model on 4GB VRAM? It’s actually done! AirLLM pulled off a slick trick—layered inference, instead of loading the whole model into VRAM at once, it loads layer by layer, discards after computation, squeezing the giant into a tiny GPU card. The coolest part: 100% open source, freebie warning ⚠️ 🔗 https://t.co/gpiHYFwt69 https://t.co/YzPnYTCHGz
— # 0xSojalSec/airllm Source: https://github.com/0xSojalSec/airllm airllm_logo Quickstart | Configurations | MacOS | Example notebooks | FAQ AirLLM optimizes inference memory usage, allowing 70B large language models to run inference on a single 4GB GPU card without quantization, distillation and pruning. And you can run 405B Llama3.1 on 8GB vram now. GitHub Repo stars Downloads (https://pepy.tech/project/airllm) Code License (https://github.com/LianjiaTech/BELLE/blob/main/LICENSE) Generic badge (https://static.aicompose.cn/static/wecom_barcode.png?t=1671918938) Discord (https://discord.gg/2xffU5sn) PyPI - AirLLM (https://pypi.org/project/airllm/) Website (https://medium.com/@lyo.gavin) Website (https://gavinliblog.com) Support me on Patreon (https://patreon.com/gavinli) GitHub Sponsors (https://github.com/sponsors/lyogavin) ## AI Agents Recommendation: * Best AI Game Sprite Generator (https://godmodeai.co) * Best AI Facial Expression Editor (https://crazyfaceai.com) ## Updates [2024/08/20] v2.11.0: Support Qwen2.5 [2024/08/18] v2.10.1 Support CPU inference. Support non sharded models. Thanks @NavodPeiris for the great work! [2024/07/30] Support Llama3.1 405B (example notebook (https://colab.research.google.com/github/lyogavin/airllm/blob/main/air_llm/examples/run_llama3.1_405B.ipynb)). Support 8bit/4bit quantization. [2024/04/20] AirLLM supports Llama3 natively already. Run Llama3 70B on 4GB single GPU. [2023/12/25] v2.8.2: Support MacOS running 70B large language models. [2023/12/20] v2.7: Support AirLLMMixtral. [2023/12/20] v2.6: Added AutoModel, automatically detect model type, no need to provide model class to initialize model. [2023/12/18] v2.5: added prefetching to overlap the model loading and compute. 10% speed improvement. [2023/12/03] added support of ChatGLM, QWen, Baichuan, Mistral, InternLM! [2023/12/02] added support for safetensors. Now support all top 10 models in open llm leaderboard. [2023/12/01] airllm 2.0. Support compressions: 3x run time speed up! [2023/11/20] airllm Initial version! ## Star History Star History Chart (https://star-history.com/#lyogavin/airllm&Timeline) ## Table of Contents * Quick start * Model Compression * Configurations * Run on MacOS * Example notebooks * Supported Models * Acknowledgement * FAQ ## Quickstart ### 1. Install package First, install the airllm pip package. bash pip install airllm ### 2. Inference Then, initialize AirLLMLlama2, pass in the huggingface repo ID of the model being used, or the local path, and inference can be performed similar to a regular transformer model. (You can also specify the path to save the splitted layered model through layer_shards_saving_path when init AirLLMLlama2. python from airllm import AutoModel MAX_LENGTH = 128 # could use hugging face model repo id: model = AutoModel.from_pretrained("garage-bAInd/Platypus2-70B-instruct") # or use model's local path... #model = AutoModel.from_pretrained("/home/ubuntu/.cache/huggingface/hub/models--garage-bAInd--Platypus2-70B-instruct/snapshots/b585e74bcaae02e52665d9ac6d23f4d0dbc81a0f") input_text = [ 'What is the capital of United States?', #'I like', ] input_tokens = model.tokenizer(input_text, return_tensors="pt", return_attention_mask=False, truncation=True, max_length=MAX_LENGTH, padding=False) generation_output = model.generate( input_tokens['input_ids'].cuda(), max_new_tokens=20, use_cache=True, return_dict_in_generate=True) output = model.tokenizer.decode(generation_output.sequences[0]) print(output) Note: During inference, the original model will first be decomposed and saved layer-wise. Please ensure there is sufficient disk space in the huggingface cache directory. ## Model Compression - 3x Inference Speed Up! We just added model compression based on block-wise quantization-based model compression. Which can further speed up the inference speed for up to 3x , with almost ignorable accuracy loss! (see more performance evaluation and why we use block-wise quantization in this paper (https://arxiv.org/abs/2212.09720)) speed_improvement #### How to enable model compression speed up: * Step 1. make sure you have bitsandbytes (https://github.com/TimDettmers/bitsandbytes) installed by pip install -U bitsandbytes * Step 2. make sure airllm verion later than 2.0.0: pip install -U airllm * Step 3. when initialize the model, passing the argument compression (‘4bit’ or ‘8bit’): python model = AutoModel.from_pretrained("garage-bAInd/Platypus2-70B-instruct", compression='4bit' # specify '8bit' for 8-bit block-wise quantization ) #### What are the differences between model compression and quantization? Quantization normally needs to quantize both weights and activations to really speed things up. Which makes it harder to maintain accuracy and avoid the impact of outliers in all kinds of inputs. While in our case the bottleneck is mainly at the disk loading, we only need to make the model loading size smaller. So, we get to only quantize the weights’ part, which is easier to ensure the accuracy. ## Configurations When initialize the model, we support the following configurations: * compression: supported options: 4bit, 8bit for 4-bit or 8-bit block-wise quantization, or by default None for no compression * profiling_mode: supported options: True to output time consumptions or by default False * layer_shards_saving_path: optionally another path to save the splitted model * hf_token: huggingface token can be provided here if downloading gated models like: meta-llama/Llama-2-7b-hf * prefetching: prefetching to overlap the model loading and compute. By default, turned on. For now, only AirLLMLlama2 supports this. * delete_original: if you don’t have too much disk space, you can set delete_original to true to delete the original downloaded hugging face model, only keep the transformed one to save half of the disk space. ## MacOS Just install airllm and run the code the same as on linux. See more in Quick Start. * make sure you installed mlx (https://github.com/ml-explore/mlx?tab=readme-ov-file#installation) and torch * you probably need to install python native see more here (https://stackoverflow.com/a/65432861/21230266) * only Apple silicon (https://support.apple.com/en-us/HT211814) is supported Example [python notebook] (https://github.com/lyogavin/airllm/blob/main/air_llm/examples/run_on_macos.ipynb) ## Example Python Notebook Example colabs here: #### example of other models (ChatGLM, QWen, Baichuan, Mistral, etc): * ChatGLM: python from airllm import AutoModel MAX_LENGTH = 128 model = AutoModel.from_pretrained("THUDM/chatglm3-6b-base") input_text = ['What is the capital of China?',] input_tokens = model.tokenizer(input_text, return_tensors="pt", return_attention_mask=False, truncation=True, max_length=MAX_LENGTH, padding=True) generation_output = model.generate( input_tokens['input_ids'].cuda(), max_new_tokens=5, use_cache= True, return_dict_in_generate=True) model.tokenizer.decode(generation_output.sequences[0]) * QWen: python from airllm import AutoModel MAX_LENGTH = 128 model = AutoModel.from_pretrained("Qwen/Qwen-7B") input_text = ['What is the capital of China?',] input_tokens = model.tokenizer(input_text, return_tensors="pt", return_attention_mask=False, truncation=True, max_length=MAX_LENGTH) generation_output = model.generate( input_tokens['input_ids'].cuda(), max_new_tokens=5, use_cache=True, return_dict_in_generate=True) model.tokenizer.decode(generation_output.sequences[0]) * Baichuan, InternLM, Mistral, etc: python from airllm import AutoModel MAX_LENGTH = 128 model = AutoModel.from_pretrained("baichuan-inc/Baichuan2-7B-Base") #model = AutoModel.from_pretrained("internlm/internlm-20b") #model = AutoModel.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1") input_text = ['What is the capital of China?',] input_tokens = model.tokenizer(input_text, return_tensors="pt", return_attention_mask=False, truncation=True, max_length=MAX_LENGTH) generation_output = model.generate( input_tokens['input_ids'].cuda(), max_new_tokens=5, use_cache=True, return_dict_in_generate=True) model.tokenizer.decode(generation_output.sequences[0]) #### To request other model support: here (https://docs.google.com/forms/d/e/1FAIpQLSe0Io9ANMT964Zi-OQOq1TJmnvP-G3_ZgQDhP7SatN0IEdbOg/viewform?usp=sf_link) ## Acknowledgement A lot of the code are based on SimJeg’s great work in the Kaggle exam competition. Big shoutout to SimJeg: GitHub account @SimJeg (https://github.com/SimJeg), the code on Kaggle (https://www.kaggle.com/code/simjeg/platypus2-70b-with-wikipedia-rag), the associated discussion (https://www.kaggle.com/competitions/kaggle-llm-science-exam/discussion/446414). ## FAQ ### 1. MetadataIncompleteBuffer safetensors_rust.SafetensorError: Error while deserializing header: MetadataIncompleteBuffer If you run into this error, most possible cause is you run out of disk space. The process of splitting model is very disk-consuming. See this (https://huggingface.co/TheBloke/guanaco-65B-GPTQ/discussions/12). You may need to extend your disk space, clear huggingface .cache (https://huggingface.co/docs/datasets/cache) and rerun. ### 2. ValueError: max() arg is an empty sequence Most likely you are loading QWen or ChatGLM model with Llama2 class. Try the following: For QWen model: python from airllm import AutoModel #<----- instead of AirLLMLlama2 AutoModel.from_pretrained(...) For ChatGLM model: python from airllm import AutoModel #<----- instead of AirLLMLlama2 AutoModel.from_pretrained(...) ### 3. 401 Client Error….Repo model … is gated. Some models are gated models, needs huggingface api token. You can provide hf_token: python model = AutoModel.from_pretrained("meta-llama/Llama-2-7b-hf", #hf_token='HF_API_TOKEN') ### 4. ValueError: Asking to pad but the tokenizer does not have a padding token. Some model’s tokenizer doesn’t have padding token, so you can set a padding token or simply turn the padding config off: python input_tokens = model.tokenizer(input_text, return_tensors="pt", return_attention_mask=False, truncation=True, max_length=MAX_LENGTH, padding=False #<----------- turn off padding ) ## Citing AirLLM If you find AirLLM useful in your research and wish to cite it, please use the following BibTex entry: @software{airllm2023, author = {Gavin Li}, title = {AirLLM: scaling large language models on low-end commodity computers}, url = {https://github.com/lyogavin/airllm/}, version = {0.0}, year = {2023}, } ## Contribution Welcomed contributions, ideas and discussions! If you find it useful, please ⭐ or buy me a coffee! 🙏 “Buy Me A Coffee” (https://bmc.link/lyogavinQ)
Similar Articles
@NFTCPS: Attention to those running large models locally! Someone has transformed llama.cpp into a performance beast — BeeLlama.cpp. With the same VRAM, inference speed triples and context capacity expands 7.5x. This isn't a slide deck; it's real benchmark data. It stuffs three top-tier optimizations into one codebase: DFlash speculative decoding…
BeeLlama.cpp is a fork of llama.cpp that integrates DFlash speculative decoding, TurboQuant/TCQ KV-cache compression, and adaptive draft control, achieving up to 3x faster inference and 7.5x context expansion on the same hardware.
@tom_doerr: Runs 70B LLMs on single 4GB GPU https://github.com/lyogavin/airllm
AirLLM is an open-source tool that optimizes inference memory usage, enabling 70B LLMs to run on a single 4GB GPU without quantization, and supports 405B models on 8GB VRAM.
@berryxia: Apple has been betting on on-device models all along! Unified architecture memory is the natural habitat for on-device models! Unified memory means memory is VRAM. We are seeing more and more excellent on-device models emerge. OpenBMB released MiniCPM-V 4.6, a 1.3B multimodal model. After reading it…
OpenBMB released MiniCPM-V 4.6, a 1.3B parameter multimodal model. Using high-resolution visual processing and efficient compression, it achieves fast inference on consumer hardware and mobile phones, outperforming larger models. It is fully open-source and supports multiple inference and quantization frameworks.
SwiftLM: Pure-Swift Apple Silicon LLM inference server—no Python, runs big models on low-RAM Macs
SwiftLM is a Swift-native LLM inference server for Apple Silicon that runs large models without Python, using SSD streaming to load MoE weights and enabling 122B models on 64 GB Macs.
lyogavin/airllm
AirLLM is an open-source library that enables running large language models (up to 405B) on a single 4GB GPU without quantization, distillation, or pruning, significantly lowering the hardware barrier for local LLM inference.