ggml-org/llama.cpp

GitHub Trending (daily) 工具

llm-inference cpp open-source quantization local-ai cpu-inference gpu-support

摘要

llama.cpp 是一个开源 C/C++ 库，用于在本地硬件上高效运行 LLM 推理，支持多种量化方法和多后端（CPU、GPU 等）。

基于 C/C++ 的 LLM 推理

查看原文

查看缓存全文

缓存时间: 2026/05/18 12:33

ggml-org/llama.cpp

来源：https://github.com/ggml-org/llama.cpp

llama.cpp

llama 许可证：MIT (https://opensource.org/licenses/MIT)
发布版本 (https://github.com/ggml-org/llama.cpp/releases)
服务器 (https://github.com/ggml-org/llama.cpp/actions/workflows/server.yml)
宣言 (https://github.com/ggml-org/llama.cpp/discussions/205) / ggml (https://github.com/ggml-org/ggml) / 操作 (https://github.com/ggml-org/llama.cpp/blob/master/docs/ops.md)

用 C/C++ 实现 LLM 推理

近期 API 变更

libllama API 变更日志 (https://github.com/ggml-org/llama.cpp/issues/9289)
llama-server REST API 变更日志 (https://github.com/ggml-org/llama.cpp/issues/9291)

快速入门

开始使用 llama.cpp 非常简单。以下是在您的机器上安装它的几种方法：

使用 brew, nix 或 winget 安装 llama.cpp
使用 Docker 运行——请参阅我们的 Docker 文档
从发布页面 (https://github.com/ggml-org/llama.cpp/releases) 下载预编译的二进制文件
通过克隆此仓库从源码构建——请查阅我们的构建指南

安装后，您需要一个模型来使用。转到“获取和量化模型”部分了解更多信息。

示例命令：

# 使用本地模型文件  
llama-cli -m my_model.gguf  

# 或者直接从 Hugging Face 下载并运行模型  
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF  

# 启动兼容 OpenAI 的 API 服务器  
llama-server -hf ggml-org/gemma-3-1b-it-GGUF

描述

llama.cpp 的主要目标是以最少的设置和最先进的性能在广泛的硬件上（本地和云端）实现 LLM 推理。

纯 C/C++ 实现，无需任何依赖
Apple silicon 是一等公民——通过 ARM NEON、Accelerate 和 Metal 框架优化
AVX、AVX2、AVX512 和 AMX 支持 x86 架构
RVV、ZVFH、ZFH、ZICBOP 和 ZIHINTPAUSE 支持 RISC-V 架构
1.5 位、2 位、3 位、4 位、5 位、6 位和 8 位整数量化，用于更快的推理和减少内存使用
自定义 CUDA 内核，用于在 NVIDIA GPU 上运行 LLM（通过 HIP 支持 AMD GPU，通过 MUSA 支持 Moore Threads GPU）
Vulkan 和 SYCL 后端支持
CPU+GPU 混合推理，用于部分加速超出总 VRAM 容量的模型

llama.cpp 项目是开发 ggml (https://github.com/ggml-org/ggml) 库新功能的主要试验场。

模型

通常，以下基础模型的微调版本也受支持。添加新模型支持的说明：HOWTO-add-model.md

纯文本模型

LLaMA 🦙
LLaMA 2 🦙🦙
LLaMA 3 🦙🦙🦙
Mistral 7B (https://huggingface.co/mistralai/Mistral-7B-v0.1)
Mixtral MoE (https://huggingface.co/models?search=mistral-ai/Mixtral)
DBRX (https://huggingface.co/databricks/dbrx-instruct)
Jamba (https://huggingface.co/ai21labs)
Falcon (https://huggingface.co/models?search=tiiuae/falcon)
中文 LLaMA / Alpaca (https://github.com/ymcui/Chinese-LLaMA-Alpaca) 和中文 LLaMA-2 / Alpaca-2 (https://github.com/ymcui/Chinese-LLaMA-Alpaca-2)
Vigogne（法语）(https://github.com/bofenghuang/vigogne)
BERT (https://github.com/ggml-org/llama.cpp/pull/5423)
Koala (https://bair.berkeley.edu/blog/2023/04/03/koala/)
Baichuan 1 & 2 (https://huggingface.co/models?search=baichuan-inc/Baichuan) + 衍生版本 (https://huggingface.co/hiyouga/baichuan-7b-sft)
Aquila 1 & 2 (https://huggingface.co/models?search=BAAI/Aquila)
Starcoder 模型 (https://github.com/ggml-org/llama.cpp/pull/3187)
Refact (https://huggingface.co/smallcloudai/Refact-1_6B-fim)
MPT (https://github.com/ggml-org/llama.cpp/pull/3417)
Bloom (https://github.com/ggml-org/llama.cpp/pull/3553)
Yi 模型 (https://huggingface.co/models?search=01-ai/Yi)
StableLM 模型 (https://huggingface.co/stabilityai)
Deepseek 模型 (https://huggingface.co/models?search=deepseek-ai/deepseek)
Qwen 模型 (https://huggingface.co/models?search=Qwen/Qwen)
PLaMo-13B (https://github.com/ggml-org/llama.cpp/pull/3557)
Phi 模型 (https://huggingface.co/models?search=microsoft/phi)
PhiMoE (https://github.com/ggml-org/llama.cpp/pull/11003)
GPT-2 (https://huggingface.co/gpt2)
Orion 14B (https://github.com/ggml-org/llama.cpp/pull/5118)
InternLM2 (https://huggingface.co/models?search=internlm2)
CodeShell (https://github.com/WisdomShell/codeshell)
Gemma (https://ai.google.dev/gemma)
Mamba (https://github.com/state-spaces/mamba)
Grok-1 (https://huggingface.co/keyfan/grok-1-hf)
Xverse (https://huggingface.co/models?search=xverse)
Command-R 模型 (https://huggingface.co/models?search=CohereForAI/c4ai-command-r)
SEA-LION (https://huggingface.co/models?search=sea-lion)
GritLM-7B (https://huggingface.co/GritLM/GritLM-7B) + GritLM-8x7B (https://huggingface.co/GritLM/GritLM-8x7B)
OLMo (https://allenai.org/olmo)
OLMo 2 (https://allenai.org/olmo)
OLMoE (https://huggingface.co/allenai/OLMoE-1B-7B-0924)
Granite 模型 (https://huggingface.co/collections/ibm-granite/granite-code-models-6624c5cec322e4c148c8b330)
GPT-NeoX (https://github.com/EleutherAI/gpt-neox) + Pythia (https://github.com/EleutherAI/pythia)
Snowflake-Arctic MoE (https://huggingface.co/collections/Snowflake/arctic-66290090abe542894a5ac520)
Smaug (https://huggingface.co/models?search=Smaug)
Poro 34B (https://huggingface.co/LumiOpen/Poro-34B)
Bitnet b1.58 模型 (https://huggingface.co/1bitLLM)
Flan T5 (https://huggingface.co/models?search=flan-t5)
Open Elm 模型 (https://huggingface.co/collections/apple/openelm-instruct-models-6619ad295d7ae9f868b759ca)
ChatGLM3-6b (https://huggingface.co/THUDM/chatglm3-6b) + ChatGLM4-9b (https://huggingface.co/THUDM/glm-4-9b) + GLMEdge-1.5b (https://huggingface.co/THUDM/glm-edge-1.5b-chat) + GLMEdge-4b (https://huggingface.co/THUDM/glm-edge-4b-chat)
GLM-4-0414 (https://huggingface.co/collections/THUDM/glm-4-0414-67f3cbcb34dd9d252707cb2e)
SmolLM (https://huggingface.co/collections/HuggingFaceTB/smollm-6695016cad7167254ce15966)
EXAONE-3.0-7.8B-Instruct (https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct)
FalconMamba 模型 (https://huggingface.co/collections/tiiuae/falconmamba-7b-66b9a580324dd1598b0f6d4a)
Jais (https://huggingface.co/inceptionai/jais-13b-chat)
Bielik-11B-v2.3 (https://huggingface.co/collections/speakleash/bielik-11b-v23-66ee813238d9b526a072408a)
RWKV-7 (https://huggingface.co/collections/shoumenchougou/rwkv7-gxx-gguf)
RWKV-6 (https://github.com/BlinkDL/RWKV-LM)
QRWKV-6 (https://huggingface.co/recursal/QRWKV6-32B-Instruct-Preview-v0.1)
GigaChat-20B-A3B (https://huggingface.co/ai-sage/GigaChat-20B-A3B-instruct)
Trillion-7B-preview (https://huggingface.co/trillionlabs/Trillion-7B-preview)
Ling 模型 (https://huggingface.co/collections/inclusionAI/ling-67c51c85b34a7ea0aba94c32)
LFM2 模型 (https://huggingface.co/collections/LiquidAI/lfm2-686d721927015b2ad73eaa38)
Hunyuan 模型 (https://huggingface.co/collections/tencent/hunyuan-dense-model-6890632cda26b19119c9c5e7)
BailingMoeV2 (Ring/Ling 2.0) 模型 (https://huggingface.co/collections/inclusionAI/ling-v2-68bf1dd2fc34c306c1fa6f86)

多模态模型

LLaVA 1.5 模型 (https://huggingface.co/collections/liuhaotian/llava-15-653aac15d994e992e2677a7e), LLaVA 1.6 模型 (https://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2)
BakLLaVA (https://huggingface.co/models?search=SkunkworksAI/Bakllava)
Obsidian (https://huggingface.co/NousResearch/Obsidian-3B-V0.5)
ShareGPT4V (https://huggingface.co/models?search=Lin-Chen/ShareGPT4V)
MobileVLM 1.7B/3B 模型 (https://huggingface.co/models?search=mobileVLM)
Yi-VL (https://huggingface.co/models?search=Yi-VL)
Mini CPM (https://huggingface.co/models?search=MiniCPM)
Moondream (https://huggingface.co/vikhyatk/moondream2)
Bunny (https://github.com/BAAI-DCAI/Bunny)
GLM-EDGE (https://huggingface.co/models?search=glm-edge)
Qwen2-VL (https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d)
LFM2-VL (https://huggingface.co/collections/LiquidAI/lfm2-vl-68963bbc84a610f7638d5ffa)

绑定

Python: ddh0/easy-llama (https://github.com/ddh0/easy-llama)
Python: abetlen/llama-cpp-python (https://github.com/abetlen/llama-cpp-python)
Go: go-skynet/go-llama.cpp (https://github.com/go-skynet/go-llama.cpp)
Node.js: withcatai/node-llama-cpp (https://github.com/withcatai/node-llama-cpp)
JS/TS（llama.cpp 服务器客户端）：lgrammel/modelfusion (https://modelfusion.dev/integration/model-provider/llamacpp)
JS/TS（可编程提示引擎 CLI）：offline-ai/cli (https://github.com/offline-ai/cli)
JavaScript/Wasm（可在浏览器中运行）：tangledgroup/llama-cpp-wasm (https://github.com/tangledgroup/llama-cpp-wasm)
Typescript/Wasm（更友好的 API，可在 npm 上获取）：ngxson/wllama (https://github.com/ngxson/wllama)
Ruby: yoshoku/llama_cpp.rb (https://github.com/yoshoku/llama_cpp.rb)
Ruby: docusealco/rllama (https://github.com/docusealco/rllama)
Rust（功能更丰富）：edgenai/llama_cpp-rs (https://github.com/edgenai/llama_cpp-rs)
Rust（更友好的 API）：mdrokz/rust-llama.cpp (https://github.com/mdrokz/rust-llama.cpp)
Rust（更直接的绑定）：utilityai/llama-cpp-rs (https://github.com/utilityai/llama-cpp-rs)
Rust（从 crates.io 自动构建）：ShelbyJenkins/llm_client (https://github.com/ShelbyJenkins/llm_client)
C#/.NET: SciSharp/LLamaSharp (https://github.com/SciSharp/LLamaSharp)
C#/VB.NET（功能更丰富——社区许可）：LM-Kit.NET (https://docs.lm-kit.com/lm-kit-net/index.html)
Scala 3: donderom/llm4s (https://github.com/donderom/llm4s)
Clojure: phronmophobic/llama.clj (https://github.com/phronmophobic/llama.clj)
React Native: mybigday/llama.rn (https://github.com/mybigday/llama.rn)
Java: kherud/java-llama.cpp (https://github.com/kherud/java-llama.cpp)
Java: QuasarByte/llama-cpp-jna (https://github.com/QuasarByte/llama-cpp-jna)
Zig: deins/llama.cpp.zig (https://github.com/Deins/llama.cpp.zig)
Flutter/Dart: netdur/llama_cpp_dart (https://github.com/netdur/llama_cpp_dart)
Flutter: xuegao-tzx/Fllama (https://github.com/xuegao-tzx/Fllama)
PHP（基于 llama.cpp 构建的 API 绑定和功能）：distantmagic/resonance (https://github.com/distantmagic/resonance)（更多信息）(https://github.com/ggml-org/llama.cpp/pull/6326)
Guile Scheme: guile_llama_cpp (https://savannah.nongnu.org/projects/guile-llama-cpp)
Swift srgtuszy/llama-cpp-swift (https://github.com/srgtuszy/llama-cpp-swift)
Swift ShenghaiWang/SwiftLlama (https://github.com/ShenghaiWang/SwiftLlama)
Delphi Embarcadero/llama-cpp-delphi (https://github.com/Embarcadero/llama-cpp-delphi)
Go（无需 CGo）：hybridgroup/yzma (https://github.com/hybridgroup/yzma)
Android: llama.android

用户界面

（要在此处列出项目，应明确声明依赖于 llama.cpp）

AI Sublime Text 插件 (https://github.com/yaroslavyaroslav/OpenAI-sublime-text) (MIT)
BonzAI 应用 (https://apps.apple.com/us/app/bonzai-your-local-ai-agent/id6752847988) (专有)
cztomsik/ava (https://github.com/cztomsik/ava) (MIT)
Dot (https://github.com/alexpinel/Dot) (GPL)
eva (https://github.com/ylsdamxssjxxdd/eva) (MIT)
iohub/collama (https://github.com/iohub/coLLaMA) (Apache-2.0)
janhq/jan (https://github.com/janhq/jan) (AGPL)
johnbean393/Sidekick (https://github.com/johnbean393/Sidekick) (MIT)
KanTV (https://github.com/zhouwg/kantv?tab=readme-ov-file) (Apache-2.0)
KodiBot (https://github.com/firatkiral/kodibot) (GPL)
llama.vim (https://github.com/ggml-org/llama.vim) (MIT)
LARS (https://github.com/abgulati/LARS) (AGPL)
Llama Assistant (https://github.com/vietanhdev/llama-assistant) (GPL)
LlamaLib (https://github.com/undreamai/LlamaLib) (Apache-2.0)
LLMFarm (https://github.com/guinmoon/LLMFarm?tab=readme-ov-file) (MIT)
LLMUnity (https://github.com/undreamai/LLMUnity) (MIT)
LMStudio (https://lmstudio.ai/) (专有)
LocalAI (https://github.com/mudler/LocalAI) (MIT)
LostRuins/koboldcpp (https://github.com/LostRuins/koboldcpp) (AGPL)
MindMac (https://mindmac.app) (专有)
MindWorkAI/AI-Studio (https://github.com/MindWorkAI/AI-Studio) (FSL-1.1-MIT)
Mobile-Artificial-Intelligence/maid (https://github.com/Mobile-Artificial-Intelligence/maid) (MIT)
Mozilla-Ocho/llamafile (https://github.com/Mozilla-Ocho/llamafile) (Apache-2.0)
nat/openplayground (https://github.com/nat/openplayground) (MIT)
nomic-ai/gpt4all (https://github.com/nomic-ai/gpt4all) (MIT)
ollama/ollama (https://github.com/ollama/ollama) (MIT)
oobabooga/text-generation-webui (https://github.com/oobabooga/text-generation-webui) (AGPL)
PocketPal AI (https://github.com/a-ghorbani/pocketpal-ai) (MIT)
psugihara/FreeChat (https://github.com/psugihara/FreeChat) (MIT)
ptsochantaris/emeltal (https://github.com/ptsochantaris/emeltal) (MIT)
pythops/tenere (https://github.com/pythops/tenere) (AGPL)
ramalama (https://github.com/containers/ramalama) (MIT)
semperai/amica (https://github.com/semperai/amica) (MIT)
withcatai/catai (https://github.com/withcatai/catai) (MIT)
Autopen (https://github.com/blackhole89/autopen) (GPL)

工具

akx/ggify (https://github.com/akx/ggify) – 从 Hugging Face Hub 下载 PyTorch 模型并将其转换为 GGML
akx/ollama-dl (https://github.com/akx/ollama-dl) – 从 Ollama 库下载模型，直接用于 llama.cpp
crashr/gppm (https://github.com/crashr/gppm) – 启动利用 NVIDIA Tesla P40 或 P100 GPU 的 llama.cpp 实例，并降低空闲功耗
gpustack/gguf-parser (https://github.com/gpustack/gguf-parser-go/tree/main/cmd/gguf-parser) – 检查/查看 GGUF 文件并估算内存使用量
Styled Lines (https://marketplace.unity.com/packages/tools/generative-ai/styled-lines-llama-cpp-model-292902)（专有许可，用于 Unity3d 游戏开发的推理部分异步包装器，附带预构建的移动和 Web 平台包装器及模型示例）
unslothai/unsloth (https://github.com/unslothai/unsloth) – 🦥 将微调和训练的模型导出/保存为 GGUF (Apache-2.0)

基础设施

Paddler (https://github.com/intentee/paddler) – 开源 LLMOps 平台，用于在您的环境中托管和扩展 AI

相似文章

llama.cpp 就是 LLM 界的 Linux

Reddit r/LocalLLaMA

文章把 llama.cpp 比作 Linux，认为这款开源库已成为运行大语言模型的基础底座。

llama : 网站 + 统一的 `llama` 二进制文件 · ggml-org/llama.cpp · 讨论 #23875

Reddit r/LocalLLaMA

Llama.cpp 宣布推出新网站和统一的 'llama' 二进制文件，以简化 LLM 推理，同时还包括 Hugging Face 缓存迁移和多模态支持等更新。

llama: 限制 `llama_context` 的最大输出数 by am17an · Pull Request #23861 · ggml-org/llama.cpp

Reddit r/LocalLLaMA

llama.cpp 是一个开源的 LLM 推理引擎，使用 C/C++ 编写，依赖极少，支持多种硬件和量化方法。此拉取请求限制了 llama_context 的最大输出数。

vllm-project/vllm v0.19.1

GitHub Releases Watchlist

vLLM v0.19.1 发布 - 一个快速易用的开源 LLM 推理和服务库，拥有业界领先的吞吐量，支持 200+ 个模型架构以及包括 NVIDIA/AMD GPU 和 CPU 在内的多样化硬件。

@0xSero：关于 LLM 推理与部署，看这一篇就够了。你听说过：- vLLM - SGLang - llama.cpp - …

X AI KOLs Timeline

vLLM、SGLang、llama.cpp 与 ExLlamaV3 等主流开源推理引擎概览，助你轻松托管并运行大模型。

ggml-org/llama.cpp

ggml-org/llama.cpp

llama.cpp

近期 API 变更

热门话题

快速入门

描述

模型

纯文本模型

多模态模型

绑定

用户界面

工具

基础设施

相似文章

llama.cpp 就是 LLM 界的 Linux

llama : 网站 + 统一的 `llama` 二进制文件 · ggml-org/llama.cpp · 讨论 #23875

llama: 限制 `llama_context` 的最大输出数 by am17an · Pull Request #23861 · ggml-org/llama.cpp

vllm-project/vllm v0.19.1

@0xSero：关于 LLM 推理与部署，看这一篇就够了。你听说过：- vLLM - SGLang - llama.cpp - …

提交意见反馈