ggml-cpu：pl752 提交的 x86 与通用 CPU q1_0 点积优化（后续）· Pull Request #21636 · ggml-org/llama.cpp

Reddit r/LocalLLaMA 2026/04/21 11:41 工具

摘要

该 PR 为 ggml-cpu 加入针对 x86 及通用 CPU 优化的 q1_0 点积内核，提升量化 LLM 推理速度。

自 [b8858](https://github.com/ggml-org/llama.cpp/releases/tag/b8858) 起可用。这是优化过的 CPU 版，现在 t/s 更快。（刚在我老掉牙的笔记本上测试，16 GB DDR3 内存。优化前：0.3 t/s，优化后：1.7 t/s。显然我这台机子没有 AVX 或 AVX512，所以提升没到极致。这周我会用新本再测。）顺便说，Metal、Vulkan、CUDA 版也已支持（1-bit 版……Bonsai）。还没试的也可以去瞧瞧。

查看原文

查看缓存全文

缓存时间: 2026/04/21 14:55

ggml-org/llama.cpp 源码：https://github.com/ggml-org/llama.cpp # llama.cpp llama 许可证：MIT（https://opensource.org/licenses/MIT）发布（https://github.com/ggml-org/llama.cpp/releases）服务器（https://github.com/ggml-org/llama.cpp/actions/workflows/server.yml）宣言（https://github.com/ggml-org/llama.cpp/discussions/205） / ggml（https://github.com/ggml-org/ggml） / ops（https://github.com/ggml-org/llama.cpp/blob/master/docs/ops.md）用 C/C++ 实现的大模型推理 ## 最新 API 变更 - `libllama` API 变更日志（https://github.com/ggml-org/llama.cpp/issues/9289） - `llama-server` REST API 变更日志（https://github.com/ggml-org/llama.cpp/issues/9291） ## 热门话题 - Hugging Face 缓存迁移：使用 `-hf` 下载的模型现在存储在标准 HF 缓存目录，可与其他 HF 工具共享。 - 指南：使用 llama.cpp 的全新 WebUI（https://github.com/ggml-org/llama.cpp/discussions/16938） - 指南：用 llama.cpp 运行 gpt-oss（https://github.com/ggml-org/llama.cpp/discussions/15396） - [反馈] 更好的 llama.cpp 打包方案，方便下游使用者 🤗 - 新增原生 MXFP4 格式的 `gpt-oss` 模型支持 | PR（https://github.com/ggml-org/llama.cpp/pull/15091）| 与 NVIDIA 合作（https://blogs.nvidia.com/blog/rtx-ai-garage-openai-oss）| 讨论（https://github.com/ggml-org/llama.cpp/discussions/15095） - `llama-server` 已支持多模态：#12898（https://github.com/ggml-org/llama.cpp/pull/12898）| 文档 - VS Code FIM 补全扩展：https://github.com/ggml-org/llama.vscode - Vim/Neovim FIM 补全插件：https://github.com/ggml-org/llama.vim - Hugging Face Inference Endpoints 现已原生支持 GGUF！https://github.com/ggml-org/llama.cpp/discussions/9669 - Hugging Face GGUF 编辑器：讨论（https://github.com/ggml-org/llama.cpp/discussions/9268）| 工具（https://huggingface.co/spaces/CISCai/gguf-editor） –– ## 快速开始上手 llama.cpp 非常简单，以下是几种安装方式： - 用 brew、nix 或 winget 安装 `llama.cpp` - Docker 运行，详见 Docker 文档 - 从发布页下载预编译二进制包（https://github.com/ggml-org/llama.cpp/releases） - 克隆仓库后源码编译，参考编译指南安装完成后，你需要准备模型。阅读下方“获取与量化模型”章节了解更多。示例命令： `sh # 使用本地模型文件 llama-cli -m my_model.gguf # 或直接从 Hugging Face 下载并运行 llama-cli -hf ggml-org/gemma-3-1b-it-GGUF # 启动兼容 OpenAI 的 API 服务器 llama-server -hf ggml-org/gemma-3-1b-it-GGUF` ## 简介 `llama.cpp` 的核心目标是以最小配置、在本地或云端各种硬件上实现开箱即用、性能顶尖的大模型推理。 - 纯 C/C++ 实现，零依赖 - Apple 芯片一线支持：通过 ARM NEON、Accelerate 与 Metal 深度优化 - x86 架构支持 AVX、AVX2、AVX512 与 AMX - RISC-V 架构支持 RVV、ZVFH、ZFH、ZICBOP 与 ZIHINTPAUSE - 1.5-bit、2-bit、3-bit、4-bit、5-bit、6-bit、8-bit 整型量化，加速推理并降低内存 - 为 NVIDIA GPU 提供定制 CUDA 核（AMD GPU 通过 HIP、摩尔线程 GPU 通过 MUSA 支持） - 支持 Vulkan 与 SYCL 后端 - CPU+GPU 混合推理，可部分加速超过显存容量的模型 `llama.cpp` 也是 ggml（https://github.com/ggml-org/ggml）库新特性的主要试验场。模型基础模型的微调版通常也受支持。新增模型支持指南：HOWTO-add-model.md #### 纯文本 - [X] LLaMA 🦙 - [x] LLaMA 2 🦙🦙 - [x] LLaMA 3 🦙🦙🦙 - [X] Mistral 7B（https://huggingface.co/mistralai/Mistral-7B-v0.1） - [x] Mixtral MoE（https://huggingface.co/models?search=mistral-ai/Mixtral） - [x] DBRX（https://huggingface.co/databricks/dbrx-instruct） - [x] Jamba（https://huggingface.co/ai21labs） - [X] Falcon（https://huggingface.co/models?search=tiiuae/falcon） - [X] 中文 LLaMA / Alpaca（https://github.com/ymcui/Chinese-LLaMA-Alpaca）与中文 LLaMA-2 / Alpaca-2（https://github.com/ymcui/Chinese-LLaMA-Alpaca-2） - [X] Vigogne（法语）（https://github.com/bofenghuang/vigogne） - [X] BERT（https://github.com/ggml-org/llama.cpp/pull/5423） - [X] Koala（https://bair.berkeley.edu/blog/2023/04/03/koala/） - [X] Baichuan 1 & 2（https://huggingface.co/models?search=baichuan-inc/Baichuan）+ 衍生（https://huggingface.co/hiyouga/baichuan-7b-sft） - [X] Aquila 1 & 2（https://huggingface.co/models?search=BAAI/Aquila） - [X] Starcoder 系列（https://github.com/ggml-org/llama.cpp/pull/3187） - [X] Refact（https://huggingface.co/smallcloudai/Refact-1_6B-fim） - [X] MPT（https://github.com/ggml-org/llama.cpp/pull/3417） - [X] Bloom（https://github.com/ggml-org/llama.cpp/pull/3553） - [x] Yi 系列（https://huggingface.co/models?search=01-ai/Yi） - [X] StableLM 系列（https://huggingface.co/stabilityai） - [x] Deepseek 系列（https://huggingface.co/models?search=deepseek-ai/deepseek） - [x] Qwen 系列（https://huggingface.co/models?search=Qwen/Qwen） - [x] PLaMo-13B（https://github.com/ggml-org/llama.cpp/pull/3557） - [x] Phi 系列（https://huggingface.co/models?search=microsoft/phi） - [x] PhiMoE（https://github.com/ggml-org/llama.cpp/pull/11003） - [x] GPT-2（https://huggingface.co/gpt2） - [x] Orion 14B（https://github.com/ggml-org/llama.cpp/pull/5118） - [x] InternLM2（https://huggingface.co/models?search=internlm2） - [x] CodeShell（https://github.com/WisdomShell/codeshell） - [x] Gemma（https://ai.google.dev/gemma） - [x] Mamba（https://github.com/state-spaces/mamba） - [x] Grok-1（https://huggingface.co/keyfan/grok-1-hf） - [x] Xverse（https://huggingface.co/models?search=xverse） - [x] Command-R 系列（https://huggingface.co/models?search=CohereForAI/c4ai-command-r） - [x] SEA-LION（https://huggingface.co/models?search=sea-lion） - [x] GritLM-7B（https://huggingface.co/GritLM/GritLM-7B）+ GritLM-8x7B（https://huggingface.co/GritLM/GritLM-8x7B） - [x] OLMo（https://allenai.org/olmo） - [x] OLMo 2（https://allenai.org/olmo） - [x] OLMoE（https://huggingface.co/allenai/OLMoE-1B-7B-0924） - [x] Granite 系列（https://huggingface.co/collections/ibm-granite/granite-code-models-6624c5cec322e4c148c8b330） - [x] GPT-NeoX（https://github.com/EleutherAI/gpt-neox）+ Pythia（https://github.com/EleutherAI/pythia） - [x] Snowflake-Arctic MoE（https://huggingface.co/collections/Snowflake/arctic-66290090abe542894a5ac520） - [x] Smaug（https://huggingface.co/models?search=Smaug） - [x] Poro 34B（https://huggingface.co/LumiOpen/Poro-34B） - [x] Bitnet b1.58 系列（https://huggingface.co/1bitLLM） - [x] Flan T5（https://huggingface.co/models?search=flan-t5） - [x] Open Elm 系列（https://huggingface.co/collections/apple/openelm-instruct-models-6619ad295d7ae9f868b759ca） - [x] ChatGLM3-6b（https://huggingface.co/THUDM/chatglm3-6b）+ ChatGLM4-9b（https://huggingface.co/THUDM/glm-4-9b）+ GLMEdge-1.5b（https://huggingface.co/THUDM/glm-edge-1.5b-chat）+ GLMEdge-4b（https://huggingface.co/THUDM/glm-edge-4b-chat） - [x] GLM-4-0414（https://huggingface.co/collections/THUDM/glm-4-0414-67f3cbcb34dd9d252707cb2e） - [x] SmolLM（https://huggingface.co/collections/HuggingFaceTB/smollm-6695016cad7167254ce15966） - [x] EXAONE-3.0-7.8B-Instruct（https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct） - [x] FalconMamba 系列（https://huggingface.co/collections/tiiuae/falconmamba-7b-66b9a580324dd1598b0f6d4a） - [x] Jais（https://huggingface.co/inceptionai/jais-13b-chat） - [x] Bielik-11B-v2.3（https://huggingface.co/collections/speakleash/bielik-11b-v23-66ee813238d9b526a072408a） - [x] RWKV-7（https://huggingface.co/collections/shoumenchougou/rwkv7-gxx-gguf） - [x] RWKV-6（https://github.com/BlinkDL/RWKV-LM） - [x] QRWKV-6（https://huggingface.co/recursal/QRWKV6-32B-Instruct-Preview-v0.1） - [x] GigaChat-20B-A3B（https://huggingface.co/ai-sage/GigaChat-20B-A3B-instruct） - [X] Trillion-7B-preview（https://huggingface.co/trillionlabs/Trillion-7B-preview） - [x] Ling 系列（https://huggingface.co/collections/inclusionAI/ling-67c51c85b34a7ea0aba94c32） - [x] LFM2 系列（https://huggingface.co/collections/LiquidAI/lfm2-686d721927015b2ad73eaa38） - [x] Hunyuan 系列（https://huggingface.co/collections/tencent/hunyuan-dense-model-6890632cda26b19119c9c5e7） - [x] BailingMoeV2（Ring/Ling 2.0）系列（https://huggingface.co/collections/inclusionAI/ling-v2-68bf1dd2fc34c306c1fa6f86） #### 多模态 - [x] LLaVA 1.5 系列（https://huggingface.co/collections/liuhaotian/llava-15-653aac15d994e992e2677a7e）、LLaVA 1.6 系列（https://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2） - [x] BakLLaVA（https://huggingface.co/models?search=SkunkworksAI/Bakllava） - [x] Obsidian（https://huggingface.co/NousResearch/Obsidian-3B-V0.5） - [x] ShareGPT4V（https://huggingface.co/models?search=Lin-Chen/ShareGPT4V） - [x] MobileVLM 1.7B/3B 系列（https://huggingface.co/models?search=mobileVLM） - [x] Yi-VL（https://huggingface.co/models?search=Yi-VL） - [x] Mini CPM（https://huggingface.co/models?search=MiniCPM） - [x] Moondream（https://huggingface.co/vikhyatk/moondream2） - [x] Bunny（https://github.com/BAAI-DCAI/Bunny） - [x] GLM-EDGE（https://huggingface.co/models?search=glm-edge） - [x] Qwen2-VL（https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d） - [x] LFM2-VL（https://huggingface.co/collections/LiquidAI/lfm2-vl-68963bbc84a610f7638d5ffa）语言绑定 - Python：ddh0/easy-llama（https://github.com/ddh0/easy-llama） - Python：abetlen/llama-cpp-python（https://github.com/abetlen/llama-cpp-python） - Go：go-skynet/go-llama.cpp（https://github.com/go-skynet/go-llama.cpp） - Node.js：withcatai/node-llama-cpp（https://github.com/withcatai/node-llama-cpp） - JS/TS（llama.cpp 服务器客户端）：lgrammel/modelfusion（https://modelfusion.dev/integration/model-provider/llamacpp） - JS/TS（可编程提示引擎 CLI）：offline-ai/cli（https://github.com/offline-ai/cli） - JavaScript/Wasm（浏览器可用）：tangledgroup/llama-cpp-wasm（https://github.com/tangledgroup/llama-cpp-wasm） - TypeScript/Wasm（更友好 API，npm 可用）：ngxson/wllama（https://github.com/ngxson/wllama） - Ruby：yoshoku/llama_cpp.rb（https://github.com/yoshoku/llama_cpp.rb） - Rust（功能丰富）：edgenai/llama_cpp-rs（https://github.com/edgenai/llama_cpp-rs） - Rust（更友好 API）：mdrokz/rust-llama.cpp（https://github.com/mdrokz/rust-llama.cpp） - Rust（更直接绑定）：utilityai/llama-cpp-rs（https://github.com/utilityai/llama-cpp-rs） - Rust（crates.io 自动构建）：ShelbyJenkins/llm_client（https://github.com/ShelbyJenkins/llm_client） - C#/.NET：SciSharp/LLamaSharp（https://github.com/SciSharp/LLamaSharp） - C#/VB.NET（功能更丰富 - 社区许可证）：LM-Kit.NET（https://docs.lm-kit.com/lm-kit-net/index.html） - Scala 3：donderom/llm4s（https://github.com/donderom/llm4s） - Clojure：phronmophobic/llama.clj（https://github.com/phronmophobic/llama.clj） - React Native：mybigday/llama.rn（https://github.com/mybigday/llama.rn） - Java：kherud/java-llama.cpp（https://github.com/kherud/java-llama.cpp） - Java：QuasarByte/llama-cpp-jna（https://github.com/QuasarByte/llama-cpp-jna） - Zig：deins/llama.cpp.zig（https://github.com/Deins/llama.cpp.zig） - Flutter/Dart：netdur/llama_cpp_dart（https://github.com/netdur/llama_cpp_dart） - Flutter：xuegao-tzx/Fllama（https://github.com/xuegao-tzx/Fllama） - PHP（基于 llama.cpp 的 API 封装与功能扩展）：distantmagic/resonance（https://github.com/distantmagic/resonance）（更多信息）（https://github.com/ggml-org/llama.cpp/pull/6326） - Guile Scheme：guile_llama_cpp（https://savannah.nongnu.org/projects/guile-llama-cpp） - Swift：srgtuszy/llama-cpp-swift（https://github.com/srgtuszy/llama-cpp-swift） - Swift：ShenghaiWang/SwiftLlama（https://github.com/ShenghaiWang/SwiftLlama） - Delphi：Embarcadero/llama-cpp-delphi（https://github.com/Embarcadero/llama-cpp-delphi） - Go（无需 CGo）：hybridgroup/yzma（https://github.com/hybridgroup/yzma） - Android：llama.android 用户界面 (项目需明确声明依赖 `llama.cpp` 方可列入) - AI Sublime Text 插件（https://github.com/yaroslavyaroslav/OpenAI-sublime-text）（MIT） - BonzAI App（https://apps.apple.com/us/app/bonzai-your-local-ai-agent/id6752847988）（专有） - cztomsik/ava（https://github.com/cztomsik/ava）（MIT） - Dot（https://github.com/alexpinel/Dot）（GPL） - eva（https://github.com/ylsdamxssjxxdd/eva）（MIT） - iohub/collama（https://github.com/iohub/coLLaMA）（Apache-2.0） - janhq/jan（https://github.com/janhq/jan）（AGPL） - johnbean393/Sidekick（https://github.com/johnbean393/Sidekick）（MIT） - KanTV（https://github.com/zhouwg/kantv?tab=readme-ov-file）（Apache-2.0） - KodiBot（https://github.com/firatkiral/kodibot）（GPL） - llama.vim（https://github.com/ggml-org/llama.vim）（MIT） - LARS（https://github.com/abgulati/LARS）（AGPL） - Llama Assistant（https://github.com/vietanhdev/llama-assistant）（GPL） - LlamaLib（https://github.com/undreamai/LlamaLib）（Apache-2.0） - LLMFarm（https://github.com/guinmoon/LLMFarm?tab=readme-ov-file）（MIT） - LLMUnity（https://github.com/undreamai/LLMUnity）（MIT） - LMStudio（https://lmstudio.ai/）（专有） - LocalAI（https://github.com/mudler/LocalAI）（MIT） - LostRuins/koboldcpp（https://github.com/LostRuins/koboldcpp）（AGPL） - MindMac（https://mindmac.app）（专有） - MindWorkAI/AI-Studio（https://github.com/MindWorkAI/AI-Studio）（FSL-1.1-MIT） - Mobile-Artificial-Intelligence/maid（https://github.com/Mobile-Artificial-Intelligence/maid）（MIT） - Mozilla-Ocho/llamafile（https://github.com/Mozilla-Ocho/llamafile）（Apache-2.0） - nat/openplayground（https://github.com/nat/openplayground）（MIT） - nomic-ai/gpt4all（https://github.com/nomic-ai/gpt4all）（MIT） - ollama/ollama（https://github.com/ollama/ollama）（MIT） - oobabooga/text-generation-webui（https://github.com/oobabooga/text-generation-webui）（AGPL） - PocketPal AI（https://github.com/a-ghorbani/pocketpal-ai）（MIT） - psugihara/FreeChat（https://github.com/psugihara/FreeChat）（MIT） - ptsochantaris/emeltal（https://github.com/ptsochantaris/emeltal）（MIT） - pythops/tenere（https://github.com/pythops/tenere）（AGPL） - ramalama（https://github.com/containers/ramalama）（MIT） - semperai/amica（https://github.com/semperai/amica）（MIT） - withcatai/catai（https://github.com/withcatai/catai）（MIT） - Autopen（https://github.com/blackhole89/autopen）（GPL）工具 - akx/ggify（https://github.com/akx/ggify）——从 Hugging Face Hub 下载 PyTorch 模型并转换为 GGML - akx/ollama-dl（https://github.com/akx/ollama-dl）——下载 Ollama 库中的模型供 llama.cpp 直接使用 - crashr/gppm（https://github.com/crashr/gppm）——利用 NVIDIA Tesla P40 或 P100 GPU 启动 llama.cpp 实例，降低待机功耗 - gpustack/gguf-parser（https://github.com/gpustack/gguf-parser-go/tree/main/cmd/gguf-parser）——查看/校验 GGUF 文件并估算内存占用 - Styled Lines（https://marketplace.unity.com/packages/tools/generative-ai/styled-lines-llama-cpp-model-292902）（专有授权，Unity3D 游戏开发异步推理封装，含预构建移动与 Web 平台包装及示例模型） - unslothai/unsloth（https://github.com/unslothai/unsloth）——🦥 将微调与训练后的模型导出/保存为 GGUF（Apache-2.0）基础设施 - Paddler（https://github.com/intentee/paddler）——开源 LLMOps 平台，在自有基础设施托管与扩展 AI - GPUStack（https://github.com/gpustack/gpus

相似文章

ggml-webgpu: 提升k-quants的预填充速度并重构Q4/Q5/Q8及k-quants的矩阵乘法 by yomaytk · Pull Request #24225 · ggml-org/llama.cpp

Reddit r/LocalLLaMA

提升了k-quants的预填充速度，并重构了llama.cpp WebGPU后端中Q4/Q5/Q8及k-quants的矩阵乘法。

[llama.cpp] 非对称 KV q8/q4 缓存：当前注意事项及 GGML 仓库中的讨论

Reddit r/LocalLLaMA

讨论了在 llama.cpp 中使用非对称 KV 缓存量化时的注意事项，其中不匹配的 q8/q4 类型会导致提示处理在 CPU 而非 GPU 上进行，并提出了通过编译标志进行修复的方案。

SYCL: 从 CUDA 后端移植多列 MMVQ（在 Intel Arc 上获得约 45% 的推测解码加速）by masonmilby · Pull Request #21845 · ggml-org/llama.cpp

Reddit r/LocalLLaMA

一个针对 llama.cpp 的拉取请求，将多列 MMVQ 从 CUDA 移植到 SYCL，在 Intel Arc GPU 上实现了约 45% 的推测解码加速。

ggml-org/llama.cpp

GitHub Trending (daily)

llama.cpp 是一个开源 C/C++ 库，用于在本地硬件上高效运行 LLM 推理，支持多种量化方法和多后端（CPU、GPU 等）。

提示：使用这个llama.cpp的PR提升Intel ARC上的提示处理速度

Reddit r/LocalLLaMA

一个llama.cpp的PR显著提升了Intel ARC GPU上的提示处理速度，基准测试显示在B580上从245t/s提升到462t/s。目前该改进仅适用于F16 KV量化，计划后续支持其他量化方式。

相似文章

ggml-webgpu: 提升k-quants的预填充速度并重构Q4/Q5/Q8及k-quants的矩阵乘法 by yomaytk · Pull Request #24225 · ggml-org/llama.cpp

[llama.cpp] 非对称 KV q8/q4 缓存：当前注意事项及 GGML 仓库中的讨论

SYCL: 从 CUDA 后端移植多列 MMVQ（在 Intel Arc 上获得约 45% 的推测解码加速）by masonmilby · Pull Request #21845 · ggml-org/llama.cpp

ggml-org/llama.cpp

提示：使用这个llama.cpp的PR提升Intel ARC上的提示处理速度

提交意见反馈