ggml-webgpu: 提升k-quants的预填充速度并重构Q4/Q5/Q8及k-quants的矩阵乘法 by yomaytk · Pull Request #24225 · ggml-org/llama.cpp

Reddit r/LocalLLaMA 工具

摘要

提升了k-quants的预填充速度,并重构了llama.cpp WebGPU后端中Q4/Q5/Q8及k-quants的矩阵乘法。

此PR提升了k-quants的矩阵乘法性能。下表展示了在M2 pro上`pp512`测试的提升情况。 |量化类型|模型|[master](https://github.com/ggml-org/llama.cpp/tree/ad1b88ca0d37a2171efba1c04f1a3531c78f1b52) (t/s)|PR (t/s)|加速比| |:-|:-|:-|:-|:-| |Q2\_K|qwen3 0.6B Q2\_K - Medium|817.86 ± 6.14|1991.81 ± 6.87|2.44x| |Q3\_K|qwen35 4B Q3\_K - Medium|92.54 ± 0.13|302.24 ± 0.37|3.27x| ||gemma4 E4B Q3\_K - Medium|79.06 ± 0.08|298.73 ± 0.90|3.78x| |Q4\_K|qwen35 4B Q4\_K - Medium|243.82 ± 0.09|327.24 ± 0.59|1.34x| ||gemma4 E4B Q4\_K - Medium|238.44 ± 0.60|324.97 ± 5.74|1.36x| |Q5\_K|qwen35 4B Q5\_K - Medium|231.23 ± 0.83|307.95 ± 2.93|1.33x| ||gemma4 E4B Q5\_K - Medium|229.46 ± 0.87|306.12 ± 3.28|1.33x| |Q6\_K|qwen35 4B Q6\_K|216.19 ± 0.06|311.52 ± 0.05|1.44x| ||gemma4 E4B Q6\_K|198.79 ± 3.77|303.07 ± 3.28|1.52x|
查看原文
查看缓存全文

缓存时间: 2026/06/09 08:41

ggml-org/llama.cpp

来源:https://github.com/ggml-org/llama.cpp

llama.cpp

许可证:MIT (https://opensource.org/licenses/MIT)
发布页面 (https://github.com/ggml-org/llama.cpp/releases)
服务器 (https://github.com/ggml-org/llama.cpp/actions/workflows/server.yml)
Docker (https://github.com/ggml-org/llama.cpp/actions/workflows/docker.yml)
Winget (https://github.com/ggml-org/llama.cpp/actions/workflows/winget.yml)
宣言 (https://github.com/ggml-org/llama.cpp/discussions/205) / ggml (https://github.com/ggml-org/ggml) / ops (https://github.com/ggml-org/llama.cpp/blob/master/docs/ops.md)

用C/C++实现LLM推理

近期API变更

  • libllama API变更日志 (https://github.com/ggml-org/llama.cpp/issues/9289)
  • llama-server REST API变更日志 (https://github.com/ggml-org/llama.cpp/issues/9291)

热门话题

  • Hugging Face 缓存迁移:使用 -hf 下载的模型现在存储在标准 Hugging Face 缓存目录中,可与其他 HF 工具共享。
  • 指南:使用 llama.cpp 的新 WebUI (https://github.com/ggml-org/llama.cpp/discussions/16938)
  • 指南:使用 llama.cpp 运行 gpt-oss (https://github.com/ggml-org/llama.cpp/discussions/15396)
  • [反馈] 改进 llama.cpp 的打包以支持下游消费者 🤗
  • 已添加对本地 MXFP4 格式 gpt-oss 模型的支持 | 拉取请求 (https://github.com/ggml-org/llama.cpp/pull/15091) | 与 NVIDIA 合作 (https://blogs.nvidia.com/blog/rtx-ai-garage-openai-oss) | 评论 (https://github.com/ggml-org/llama.cpp/discussions/15095)
  • llama-server 已支持多模态:#12898 (https://github.com/ggml-org/llama.cpp/pull/12898) | 文档
  • 用于 FIM 补全的 VS Code 扩展:https://github.com/ggml-org/llama.vscode
  • 用于 FIM 补全的 Vim/Neovim 插件:https://github.com/ggml-org/llama.vim
  • Hugging Face 推理端点现已原生支持 GGUF!https://github.com/ggml-org/llama.cpp/discussions/9669
  • Hugging Face GGUF 编辑器:讨论 (https://github.com/ggml-org/llama.cpp/discussions/9268) | 工具 (https://huggingface.co/spaces/CISCai/gguf-editor)
  • 浏览器中现已支持 WebGPU,查看介绍博客/演示 (https://reeselevine.github.io/llamas-on-the-web/)。

快速开始

使用 llama.cpp 非常简单。以下是几种在您机器上安装的方式:

安装后,您需要一个模型来使用。前往 获取和量化模型 部分了解更多。

示例命令:

# 使用本地模型文件
llama-cli -m my_model.gguf

# 或直接从 Hugging Face 下载并运行模型
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF

# 启动兼容 OpenAI 的 API 服务器
llama-server -hf ggml-org/gemma-3-1b-it-GGUF

描述

llama.cpp 的主要目标是在本地和云端各种硬件上,以最小的设置和领先的性能实现 LLM 推理。

  • 纯 C/C++ 实现,无任何依赖
  • 苹果芯片是首要支持对象——通过 ARM NEON、Accelerate 和 Metal 框架优化
  • 支持 x86 架构的 AVX、AVX2、AVX512 和 AMX
  • 支持 RISC-V 架构的 RVV、ZVFH、ZFH、ZICBOP 和 ZIHINTPAUSE
  • 提供 1.5-bit、2-bit、3-bit、4-bit、5-bit、6-bit 和 8-bit 整数量化,实现更快的推理和更低的内存占用
  • 自定义 CUDA 内核,用于在 NVIDIA GPU 上运行 LLM(通过 HIP 支持 AMD GPU,通过 MUSA 支持 Moore Threads GPU)
  • 支持 Vulkan 和 SYCL 后端
  • CPU+GPU 混合推理,可部分加速超过显存容量的大型模型

llama.cpp 项目是开发 ggml (https://github.com/ggml-org/ggml) 库新功能的主要试验场。

模型

通常,以下基础模型的微调版本也受支持。添加新模型支持的说明:HOWTO-add-model.md

纯文本模型

  • LLaMA 🦙
  • LLaMA 2 🦙🦙
  • LLaMA 3 🦙🦙🦙
  • Mistral 7B (https://huggingface.co/mistralai/Mistral-7B-v0.1)
  • Mixtral MoE (https://huggingface.co/models?search=mistral-ai/Mixtral)
  • DBRX (https://huggingface.co/databricks/dbrx-instruct)
  • Jamba (https://huggingface.co/ai21labs)
  • Falcon (https://huggingface.co/models?search=tiiuae/falcon)
  • 中文 LLaMA / Alpaca (https://github.com/ymcui/Chinese-LLaMA-Alpaca) 和中文 LLaMA-2 / Alpaca-2 (https://github.com/ymcui/Chinese-LLaMA-Alpaca-2)
  • Vigogne(法语)(https://github.com/bofenghuang/vigogne)
  • BERT (https://github.com/ggml-org/llama.cpp/pull/5423)
  • Koala (https://bair.berkeley.edu/blog/2023/04/03/koala/)
  • Baichuan 1 & 2 (https://huggingface.co/models?search=baichuan-inc/Baichuan) 及其衍生模型 (https://huggingface.co/hiyouga/baichuan-7b-sft)
  • Aquila 1 & 2 (https://huggingface.co/models?search=BAAI/Aquila)
  • Starcoder 模型 (https://github.com/ggml-org/llama.cpp/pull/3187)
  • Refact (https://huggingface.co/smallcloudai/Refact-1_6B-fim)
  • MPT (https://github.com/ggml-org/llama.cpp/pull/3417)
  • Bloom (https://github.com/ggml-org/llama.cpp/pull/3553)
  • Yi 模型 (https://huggingface.co/models?search=01-ai/Yi)
  • StableLM 模型 (https://huggingface.co/stabilityai)
  • Deepseek 模型 (https://huggingface.co/models?search=deepseek-ai/deepseek)
  • Qwen 模型 (https://huggingface.co/models?search=Qwen/Qwen)
  • PLaMo-13B (https://github.com/ggml-org/llama.cpp/pull/3557)
  • Phi 模型 (https://huggingface.co/models?search=microsoft/phi)
  • PhiMoE (https://github.com/ggml-org/llama.cpp/pull/11003)
  • GPT-2 (https://huggingface.co/gpt2)
  • Orion 14B (https://github.com/ggml-org/llama.cpp/pull/5118)
  • InternLM2 (https://huggingface.co/models?search=internlm2)
  • CodeShell (https://github.com/WisdomShell/codeshell)
  • Gemma (https://ai.google.dev/gemma)
  • Mamba (https://github.com/state-spaces/mamba)
  • Grok-1 (https://huggingface.co/keyfan/grok-1-hf)
  • Xverse (https://huggingface.co/models?search=xverse)
  • Command-R 模型 (https://huggingface.co/models?search=CohereForAI/c4ai-command-r)
  • SEA-LION (https://huggingface.co/models?search=sea-lion)
  • GritLM-7B (https://huggingface.co/GritLM/GritLM-7B) + GritLM-8x7B (https://huggingface.co/GritLM/GritLM-8x7B)
  • OLMo (https://allenai.org/olmo)
  • OLMo 2 (https://allenai.org/olmo)
  • OLMoE (https://huggingface.co/allenai/OLMoE-1B-7B-0924)
  • Granite 模型 (https://huggingface.co/collections/ibm-granite/granite-code-models-6624c5cec322e4c148c8b330)
  • GPT-NeoX (https://github.com/EleutherAI/gpt-neox) + Pythia (https://github.com/EleutherAI/pythia)
  • Snowflake-Arctic MoE (https://huggingface.co/collections/Snowflake/arctic-66290090abe542894a5ac520)
  • Smaug (https://huggingface.co/models?search=Smaug)
  • Poro 34B (https://huggingface.co/LumiOpen/Poro-34B)
  • Bitnet b1.58 模型 (https://huggingface.co/1bitLLM)
  • Flan T5 (https://huggingface.co/models?search=flan-t5)
  • Open Elm 模型 (https://huggingface.co/collections/apple/openelm-instruct-models-6619ad295d7ae9f868b759ca)
  • ChatGLM3-6b (https://huggingface.co/THUDM/chatglm3-6b) + ChatGLM4-9b (https://huggingface.co/THUDM/glm-4-9b) + GLMEdge-1.5b (https://huggingface.co/THUDM/glm-edge-1.5b-chat) + GLMEdge-4b (https://huggingface.co/THUDM/glm-edge-4b-chat)
  • GLM-4-0414 (https://huggingface.co/collections/THUDM/glm-4-0414-67f3cbcb34dd9d252707cb2e)
  • SmolLM (https://huggingface.co/collections/HuggingFaceTB/smollm-6695016cad7167254ce15966)
  • EXAONE-3.0-7.8B-Instruct (https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct)
  • FalconMamba 模型 (https://huggingface.co/collections/tiiuae/falconmamba-7b-66b9a580324dd1598b0f6d4a)
  • Jais (https://huggingface.co/inceptionai/jais-13b-chat)
  • Bielik-11B-v2.3 (https://huggingface.co/collections/speakleash/bielik-11b-v23-66ee813238d9b526a072408a)
  • RWKV-7 (https://huggingface.co/collections/shoumenchougou/rwkv7-gxx-gguf)
  • RWKV-6 (https://github.com/BlinkDL/RWKV-LM)
  • QRWKV-6 (https://huggingface.co/recursal/QRWKV6-32B-Instruct-Preview-v0.1)
  • GigaChat-20B-A3B (https://huggingface.co/ai-sage/GigaChat-20B-A3B-instruct)
  • Trillion-7B-preview (https://huggingface.co/trillionlabs/Trillion-7B-preview)
  • Ling 模型 (https://huggingface.co/collections/inclusionAI/ling-67c51c85b34a7ea0aba94c32)
  • LFM2 模型 (https://huggingface.co/collections/LiquidAI/lfm2-686d721927015b2ad73eaa38)
  • Hunyuan 模型 (https://huggingface.co/collections/tencent/hunyuan-dense-model-6890632cda26b19119c9c5e7)
  • BailingMoeV2 (Ring/Ling 2.0) 模型 (https://huggingface.co/collections/inclusionAI/ling-v2-68bf1dd2fc34c306c1fa6f86)
  • Mellum 模型 (https://huggingface.co/JetBrains/models?search=mellum)

多模态模型

  • LLaVA 1.5 模型 (https://huggingface.co/collections/liuhaotian/llava-15-653aac15d994e992e2677a7e),LLaVA 1.6 模型 (https://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2)
  • BakLLaVA (https://huggingface.co/models?search=SkunkworksAI/Bakllava)
  • Obsidian (https://huggingface.co/NousResearch/Obsidian-3B-V0.5)
  • ShareGPT4V (https://huggingface.co/models?search=Lin-Chen/ShareGPT4V)
  • MobileVLM 1.7B/3B 模型 (https://huggingface.co/models?search=mobileVLM)
  • Yi-VL (https://huggingface.co/models?search=Yi-VL)
  • Mini CPM (https://huggingface.co/models?search=MiniCPM)
  • Moondream (https://huggingface.co/vikhyatk/moondream2)
  • Bunny (https://github.com/BAAI-DCAI/Bunny)
  • GLM-EDGE (https://huggingface.co/models?search=glm-edge)
  • Qwen2-VL (https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d)
  • LFM2-VL (https://huggingface.co/collections/LiquidAI/lfm2-vl-68963bbc84a610f7638d5ffa)

绑定

  • Python: ddh0/easy-llama (https://github.com/ddh0/easy-llama)
  • Python: abetlen/llama-cpp-python (https://github.com/abetlen/llama-cpp-python)
  • Go: go-skynet/go-llama.cpp (https://github.com/go-skynet/go-llama.cpp)
  • Node.js: withcatai/node-llama-cpp (https://github.com/withcatai/node-llama-cpp)
  • JS/TS (llama.cpp 服务器客户端): lgrammel/modelfusion (https://modelfusion.dev/integration/model-provider/llamacpp)
  • JS/TS (可编程提示引擎 CLI): offline-ai/cli (https://github.com/offline-ai/cli)
  • JavaScript/Wasm (可在浏览器中运行): tangledgroup/llama-cpp-wasm (https://github.com/tangledgroup/llama-cpp-wasm)
  • TypeScript/Wasm (更友好的 API,可从 npm 获取): ngxson/wllama (https://github.com/ngxson/wllama)
  • Ruby: yoshoku/llama_cpp.rb (https://github.com/yoshoku/llama_cpp.rb)
  • Ruby: docusealco/rllama (https://github.com/docusealco/rllama)
  • Rust (更多功能): edgenai/llama_cpp-rs (https://github.com/edgenai/llama_cpp-rs)
  • Rust (更友好的 API): mdrokz/rust-llama.cpp (https://github.com/mdrokz/rust-llama.cpp)
  • Rust (更直接的绑定): utilityai/llama-cpp-rs (https://github.com/utilityai/llama-cpp-rs)
  • Rust (自动从 crates.io 构建): ShelbyJenkins/llm_client (https://github.com/ShelbyJenkins/llm_client)
  • C#/.NET: SciSharp/LLamaSharp (https://github.com/SciSharp/LLamaSharp)
  • C#/VB.NET (更多功能 - 社区许可证): LM-Kit.NET (https://docs.lm-kit.com/lm-kit-net/index.html)
  • Scala 3: donderom/llm4s (https://github.com/donderom/llm4s)
  • Clojure: phronmophobic/llama.clj (https://github.com/phronmophobic/llama.clj)
  • React Native: mybigday/llama.rn (https://github.com/mybigday/llama.rn)
  • Java: kherud/java-llama.cpp (https://github.com/kherud/java-llama.cpp)
  • Java: QuasarByte/llama-cpp-jna (https://github.com/QuasarByte/llama-cpp-jna)
  • Zig: deins/llama.cpp.zig (https://github.com/Deins/llama.cpp.zig)
  • Flutter/Dart: netdur/llama_cpp_dart (https://github.com/netdur/llama_cpp_dart)
  • Flutter: xuegao-tzx/Fllama (https://github.com/xuegao-tzx/Fllama)
  • PHP(基于 llama.cpp 的 API 绑定和功能): distantmagic/resonance (https://github.com/distantmagic/resonance)(更多信息)(https://github.com/ggml-org/llama.cpp/pull/6326)
  • Guile Scheme: guile_llama_cpp (https://savannah.nongnu.org/projects/guile-llama-cpp)
  • Swift: srgtuszy/llama-cpp-swift (https://github.com/srgtuszy/llama-cpp-swift)
  • Swift: ShenghaiWang/SwiftLlama (https://github.com/ShenghaiWang/SwiftLlama)
  • Delphi: Embarcadero/llama-cpp-delphi (https://github.com/Embarcadero/llama-cpp-delphi)
  • Go(无需 CGo): hybridgroup/yzma (https://github.com/hybridgroup/yzma)
  • Android: llama.android

用户界面(UI)

(项目如需在此列出,应明确声明依赖 llama.cpp

  • AI Sublime Text 插件 (https://github.com/yaroslavyaroslav/OpenAI-sublime-text) (MIT)
  • BonzAI App (https://apps.apple.com/us/app/bonzai-your-local-ai-agent/id6752847988) (专有)
  • cztomsik/ava (https://github.com/cztomsik/ava) (MIT)
  • Dot (https://github.com/alexpinel/Dot) (GPL)
  • eva (https://github.com/ylsdamxssjxxdd/eva) (MIT)
  • iohub/collama (https://github.com/iohub/coLLaMA) (Apache-2.0)
  • janhq/jan (https://github.com/janhq/jan) (AGPL)
  • johnbean393/Sidekick (https://github.com/johnbean393/Sidekick) (MIT)
  • KanTV (https://github.com/zhouwg/kantv?tab=readme-ov-file) (Apache-2.0)
  • KodiBot (https://github.com/firatkiral/kodibot) (GPL)
  • llama.vim (https://github.com/ggml-org/llama.vim) (MIT)
  • LARS (https://github.com/abgulati/LARS) (AGPL)
  • Llama Assistant (https://github.com/vietanhdev/llama-assistant) (GPL)
  • LlamaLib (https://github.com/undreamai/LlamaLib) (Apache-2.0)
  • LLMFarm (https://github.com/guinmoon/LLMFarm?tab=readme-ov-file) (MIT)
  • LLMUnity (https://github.com/undreamai/LLMUnity) (MIT)
  • LMStudio (https://lmstudio.ai/) (专有)
  • LocalAI (https://github.com/mudler/LocalAI) (MIT)
  • LostRuins/koboldcpp (https://github.com/LostRuins/koboldcpp) (AGPL)
  • MindMac (https://mindmac.app) (专有)
  • MindWorkAI/AI-Studio (https://github.com/MindWorkAI/AI-Studio) (FSL-1.1-MIT)
  • Mobile-Artificial-Intelligence/maid (https://github.com/Mobile-Artificial-Intelligence/maid) (MIT)
  • Mozilla-Ocho/llamafile (https://github.com/Mozilla-Ocho/llamafile) (Apache-2.0)
  • nat/openplayground (https://github.com/nat/openplayground) (MIT)
  • nomic-ai/gpt4all (https://github.com/nomic-ai/gpt4all) (MIT)
  • ollama/ollama (https://github.com/ollama/ollama) (MIT)
  • oobabooga/text-generation-webui (https://github.com/oobabooga/text-generation-webui) (AGPL)
  • PocketPal AI (https://github.com/a-ghorbani/pocketpal-ai) (MIT)
  • psugihara/FreeChat (https://github.com/psugihara/FreeChat) (MIT)
  • ptsochantaris/emeltal (https://github.com/ptsochantaris/emeltal) (MIT)
  • pythops/tenere (https://github.com/pythops/tenere) (AGPL)
  • ramalama (https://github.com/containers/ramalama) (MIT)
  • semperai/amica (https://github.com/semperai/amica) (MIT)
  • withcatai/catai (https://github.com/withcatai/catai) (MIT)
  • Autopen (https://github.com/blackhole89/autopen) (GPL)

工具

  • akx/ggify (https://github.com/akx/ggify) – 从 Hugging Face Hub 下载 PyTorch 模型并转换为 GGML
  • akx/ollama-dl (https://github.com/akx/ollama-dl) – 从 Ollama 库下载模型,直接用于 llama.cpp
  • crashr/gppm (https://github.com/crashr/gppm) – 启动利用 NVIDIA Tesla P40 或 P100 GPU 的 llama.cpp 实例,降低空闲功耗
  • gpustack/gguf-parser (https://github.com/gpustack/gguf-parser-go/tree/main/cmd/gguf-parser) – 审查/检查 GGUF 文件并估算内存使用
  • Styled Lines (https://marketplace.unity.com/packages/tools/generative-ai/styled-lines-llama-cpp-model-292902)(专有许可,推理的异步包装器)

相似文章

双GPU llama.cpp加速

Reddit r/LocalLLaMA

llama.cpp的一个分支修复了量化KV缓存中的--split-mode tensor问题,在双GPU配置上实现高达40%的速度提升,且无质量损失。