ggml-cpu:pl752 提交的 x86 与通用 CPU q1_0 点积优化(后续)· Pull Request #21636 · ggml-org/llama.cpp
摘要
该 PR 为 ggml-cpu 加入针对 x86 及通用 CPU 优化的 q1_0 点积内核,提升量化 LLM 推理速度。
自 [b8858](https://github.com/ggml-org/llama.cpp/releases/tag/b8858) 起可用。这是优化过的 CPU 版,现在 t/s 更快。(刚在我老掉牙的笔记本上测试,16 GB DDR3 内存。优化前:0.3 t/s,优化后:1.7 t/s。显然我这台机子没有 AVX 或 AVX512,所以提升没到极致。这周我会用新本再测。)顺便说,Metal、Vulkan、CUDA 版也已支持(1-bit 版……Bonsai)。还没试的也可以去瞧瞧。
查看缓存全文
缓存时间: 2026/04/21 14:55
ggml-org/llama.cpp 源码:https://github.com/ggml-org/llama.cpp # llama.cpp llama 许可证:MIT(https://opensource.org/licenses/MIT) 发布(https://github.com/ggml-org/llama.cpp/releases) 服务器(https://github.com/ggml-org/llama.cpp/actions/workflows/server.yml) 宣言(https://github.com/ggml-org/llama.cpp/discussions/205) / ggml(https://github.com/ggml-org/ggml) / ops(https://github.com/ggml-org/llama.cpp/blob/master/docs/ops.md) 用 C/C++ 实现的大模型推理 ## 最新 API 变更 - libllama API 变更日志(https://github.com/ggml-org/llama.cpp/issues/9289) - llama-server REST API 变更日志(https://github.com/ggml-org/llama.cpp/issues/9291) ## 热门话题 - Hugging Face 缓存迁移:使用 -hf 下载的模型现在存储在标准 HF 缓存目录,可与其他 HF 工具共享。 - 指南:使用 llama.cpp 的全新 WebUI(https://github.com/ggml-org/llama.cpp/discussions/16938) - 指南:用 llama.cpp 运行 gpt-oss(https://github.com/ggml-org/llama.cpp/discussions/15396) - [反馈] 更好的 llama.cpp 打包方案,方便下游使用者 🤗 - 新增原生 MXFP4 格式的 gpt-oss 模型支持 | PR(https://github.com/ggml-org/llama.cpp/pull/15091)| 与 NVIDIA 合作(https://blogs.nvidia.com/blog/rtx-ai-garage-openai-oss)| 讨论(https://github.com/ggml-org/llama.cpp/discussions/15095) - llama-server 已支持多模态:#12898(https://github.com/ggml-org/llama.cpp/pull/12898)| 文档 - VS Code FIM 补全扩展:https://github.com/ggml-org/llama.vscode - Vim/Neovim FIM 补全插件:https://github.com/ggml-org/llama.vim - Hugging Face Inference Endpoints 现已原生支持 GGUF!https://github.com/ggml-org/llama.cpp/discussions/9669 - Hugging Face GGUF 编辑器:讨论(https://github.com/ggml-org/llama.cpp/discussions/9268)| 工具(https://huggingface.co/spaces/CISCai/gguf-editor) –– ## 快速开始 上手 llama.cpp 非常简单,以下是几种安装方式: - 用 brew、nix 或 winget 安装 llama.cpp - Docker 运行,详见 Docker 文档 - 从发布页下载预编译二进制包(https://github.com/ggml-org/llama.cpp/releases) - 克隆仓库后源码编译,参考 编译指南 安装完成后,你需要准备模型。阅读下方“获取与量化模型”章节了解更多。 示例命令: sh # 使用本地模型文件 llama-cli -m my_model.gguf # 或直接从 Hugging Face 下载并运行 llama-cli -hf ggml-org/gemma-3-1b-it-GGUF # 启动兼容 OpenAI 的 API 服务器 llama-server -hf ggml-org/gemma-3-1b-it-GGUF ## 简介 llama.cpp 的核心目标是以最小配置、在本地或云端各种硬件上实现开箱即用、性能顶尖的大模型推理。 - 纯 C/C++ 实现,零依赖 - Apple 芯片一线支持:通过 ARM NEON、Accelerate 与 Metal 深度优化 - x86 架构支持 AVX、AVX2、AVX512 与 AMX - RISC-V 架构支持 RVV、ZVFH、ZFH、ZICBOP 与 ZIHINTPAUSE - 1.5-bit、2-bit、3-bit、4-bit、5-bit、6-bit、8-bit 整型量化,加速推理并降低内存 - 为 NVIDIA GPU 提供定制 CUDA 核(AMD GPU 通过 HIP、摩尔线程 GPU 通过 MUSA 支持) - 支持 Vulkan 与 SYCL 后端 - CPU+GPU 混合推理,可部分加速超过显存容量的模型 llama.cpp 也是 ggml(https://github.com/ggml-org/ggml)库新特性的主要试验场。 模型 基础模型的微调版通常也受支持。新增模型支持指南:HOWTO-add-model.md #### 纯文本 - [X] LLaMA 🦙 - [x] LLaMA 2 🦙🦙 - [x] LLaMA 3 🦙🦙🦙 - [X] Mistral 7B(https://huggingface.co/mistralai/Mistral-7B-v0.1) - [x] Mixtral MoE(https://huggingface.co/models?search=mistral-ai/Mixtral) - [x] DBRX(https://huggingface.co/databricks/dbrx-instruct) - [x] Jamba(https://huggingface.co/ai21labs) - [X] Falcon(https://huggingface.co/models?search=tiiuae/falcon) - [X] 中文 LLaMA / Alpaca(https://github.com/ymcui/Chinese-LLaMA-Alpaca)与中文 LLaMA-2 / Alpaca-2(https://github.com/ymcui/Chinese-LLaMA-Alpaca-2) - [X] Vigogne(法语)(https://github.com/bofenghuang/vigogne) - [X] BERT(https://github.com/ggml-org/llama.cpp/pull/5423) - [X] Koala(https://bair.berkeley.edu/blog/2023/04/03/koala/) - [X] Baichuan 1 & 2(https://huggingface.co/models?search=baichuan-inc/Baichuan)+ 衍生(https://huggingface.co/hiyouga/baichuan-7b-sft) - [X] Aquila 1 & 2(https://huggingface.co/models?search=BAAI/Aquila) - [X] Starcoder 系列(https://github.com/ggml-org/llama.cpp/pull/3187) - [X] Refact(https://huggingface.co/smallcloudai/Refact-1_6B-fim) - [X] MPT(https://github.com/ggml-org/llama.cpp/pull/3417) - [X] Bloom(https://github.com/ggml-org/llama.cpp/pull/3553) - [x] Yi 系列(https://huggingface.co/models?search=01-ai/Yi) - [X] StableLM 系列(https://huggingface.co/stabilityai) - [x] Deepseek 系列(https://huggingface.co/models?search=deepseek-ai/deepseek) - [x] Qwen 系列(https://huggingface.co/models?search=Qwen/Qwen) - [x] PLaMo-13B(https://github.com/ggml-org/llama.cpp/pull/3557) - [x] Phi 系列(https://huggingface.co/models?search=microsoft/phi) - [x] PhiMoE(https://github.com/ggml-org/llama.cpp/pull/11003) - [x] GPT-2(https://huggingface.co/gpt2) - [x] Orion 14B(https://github.com/ggml-org/llama.cpp/pull/5118) - [x] InternLM2(https://huggingface.co/models?search=internlm2) - [x] CodeShell(https://github.com/WisdomShell/codeshell) - [x] Gemma(https://ai.google.dev/gemma) - [x] Mamba(https://github.com/state-spaces/mamba) - [x] Grok-1(https://huggingface.co/keyfan/grok-1-hf) - [x] Xverse(https://huggingface.co/models?search=xverse) - [x] Command-R 系列(https://huggingface.co/models?search=CohereForAI/c4ai-command-r) - [x] SEA-LION(https://huggingface.co/models?search=sea-lion) - [x] GritLM-7B(https://huggingface.co/GritLM/GritLM-7B)+ GritLM-8x7B(https://huggingface.co/GritLM/GritLM-8x7B) - [x] OLMo(https://allenai.org/olmo) - [x] OLMo 2(https://allenai.org/olmo) - [x] OLMoE(https://huggingface.co/allenai/OLMoE-1B-7B-0924) - [x] Granite 系列(https://huggingface.co/collections/ibm-granite/granite-code-models-6624c5cec322e4c148c8b330) - [x] GPT-NeoX(https://github.com/EleutherAI/gpt-neox)+ Pythia(https://github.com/EleutherAI/pythia) - [x] Snowflake-Arctic MoE(https://huggingface.co/collections/Snowflake/arctic-66290090abe542894a5ac520) - [x] Smaug(https://huggingface.co/models?search=Smaug) - [x] Poro 34B(https://huggingface.co/LumiOpen/Poro-34B) - [x] Bitnet b1.58 系列(https://huggingface.co/1bitLLM) - [x] Flan T5(https://huggingface.co/models?search=flan-t5) - [x] Open Elm 系列(https://huggingface.co/collections/apple/openelm-instruct-models-6619ad295d7ae9f868b759ca) - [x] ChatGLM3-6b(https://huggingface.co/THUDM/chatglm3-6b)+ ChatGLM4-9b(https://huggingface.co/THUDM/glm-4-9b)+ GLMEdge-1.5b(https://huggingface.co/THUDM/glm-edge-1.5b-chat)+ GLMEdge-4b(https://huggingface.co/THUDM/glm-edge-4b-chat) - [x] GLM-4-0414(https://huggingface.co/collections/THUDM/glm-4-0414-67f3cbcb34dd9d252707cb2e) - [x] SmolLM(https://huggingface.co/collections/HuggingFaceTB/smollm-6695016cad7167254ce15966) - [x] EXAONE-3.0-7.8B-Instruct(https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct) - [x] FalconMamba 系列(https://huggingface.co/collections/tiiuae/falconmamba-7b-66b9a580324dd1598b0f6d4a) - [x] Jais(https://huggingface.co/inceptionai/jais-13b-chat) - [x] Bielik-11B-v2.3(https://huggingface.co/collections/speakleash/bielik-11b-v23-66ee813238d9b526a072408a) - [x] RWKV-7(https://huggingface.co/collections/shoumenchougou/rwkv7-gxx-gguf) - [x] RWKV-6(https://github.com/BlinkDL/RWKV-LM) - [x] QRWKV-6(https://huggingface.co/recursal/QRWKV6-32B-Instruct-Preview-v0.1) - [x] GigaChat-20B-A3B(https://huggingface.co/ai-sage/GigaChat-20B-A3B-instruct) - [X] Trillion-7B-preview(https://huggingface.co/trillionlabs/Trillion-7B-preview) - [x] Ling 系列(https://huggingface.co/collections/inclusionAI/ling-67c51c85b34a7ea0aba94c32) - [x] LFM2 系列(https://huggingface.co/collections/LiquidAI/lfm2-686d721927015b2ad73eaa38) - [x] Hunyuan 系列(https://huggingface.co/collections/tencent/hunyuan-dense-model-6890632cda26b19119c9c5e7) - [x] BailingMoeV2(Ring/Ling 2.0)系列(https://huggingface.co/collections/inclusionAI/ling-v2-68bf1dd2fc34c306c1fa6f86) #### 多模态 - [x] LLaVA 1.5 系列(https://huggingface.co/collections/liuhaotian/llava-15-653aac15d994e992e2677a7e)、LLaVA 1.6 系列(https://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2) - [x] BakLLaVA(https://huggingface.co/models?search=SkunkworksAI/Bakllava) - [x] Obsidian(https://huggingface.co/NousResearch/Obsidian-3B-V0.5) - [x] ShareGPT4V(https://huggingface.co/models?search=Lin-Chen/ShareGPT4V) - [x] MobileVLM 1.7B/3B 系列(https://huggingface.co/models?search=mobileVLM) - [x] Yi-VL(https://huggingface.co/models?search=Yi-VL) - [x] Mini CPM(https://huggingface.co/models?search=MiniCPM) - [x] Moondream(https://huggingface.co/vikhyatk/moondream2) - [x] Bunny(https://github.com/BAAI-DCAI/Bunny) - [x] GLM-EDGE(https://huggingface.co/models?search=glm-edge) - [x] Qwen2-VL(https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d) - [x] LFM2-VL(https://huggingface.co/collections/LiquidAI/lfm2-vl-68963bbc84a610f7638d5ffa) 语言绑定 - Python:ddh0/easy-llama(https://github.com/ddh0/easy-llama) - Python:abetlen/llama-cpp-python(https://github.com/abetlen/llama-cpp-python) - Go:go-skynet/go-llama.cpp(https://github.com/go-skynet/go-llama.cpp) - Node.js:withcatai/node-llama-cpp(https://github.com/withcatai/node-llama-cpp) - JS/TS(llama.cpp 服务器客户端):lgrammel/modelfusion(https://modelfusion.dev/integration/model-provider/llamacpp) - JS/TS(可编程提示引擎 CLI):offline-ai/cli(https://github.com/offline-ai/cli) - JavaScript/Wasm(浏览器可用):tangledgroup/llama-cpp-wasm(https://github.com/tangledgroup/llama-cpp-wasm) - TypeScript/Wasm(更友好 API,npm 可用):ngxson/wllama(https://github.com/ngxson/wllama) - Ruby:yoshoku/llama_cpp.rb(https://github.com/yoshoku/llama_cpp.rb) - Rust(功能丰富):edgenai/llama_cpp-rs(https://github.com/edgenai/llama_cpp-rs) - Rust(更友好 API):mdrokz/rust-llama.cpp(https://github.com/mdrokz/rust-llama.cpp) - Rust(更直接绑定):utilityai/llama-cpp-rs(https://github.com/utilityai/llama-cpp-rs) - Rust(crates.io 自动构建):ShelbyJenkins/llm_client(https://github.com/ShelbyJenkins/llm_client) - C#/.NET:SciSharp/LLamaSharp(https://github.com/SciSharp/LLamaSharp) - C#/VB.NET(功能更丰富 - 社区许可证):LM-Kit.NET(https://docs.lm-kit.com/lm-kit-net/index.html) - Scala 3:donderom/llm4s(https://github.com/donderom/llm4s) - Clojure:phronmophobic/llama.clj(https://github.com/phronmophobic/llama.clj) - React Native:mybigday/llama.rn(https://github.com/mybigday/llama.rn) - Java:kherud/java-llama.cpp(https://github.com/kherud/java-llama.cpp) - Java:QuasarByte/llama-cpp-jna(https://github.com/QuasarByte/llama-cpp-jna) - Zig:deins/llama.cpp.zig(https://github.com/Deins/llama.cpp.zig) - Flutter/Dart:netdur/llama_cpp_dart(https://github.com/netdur/llama_cpp_dart) - Flutter:xuegao-tzx/Fllama(https://github.com/xuegao-tzx/Fllama) - PHP(基于 llama.cpp 的 API 封装与功能扩展):distantmagic/resonance(https://github.com/distantmagic/resonance)(更多信息)(https://github.com/ggml-org/llama.cpp/pull/6326) - Guile Scheme:guile_llama_cpp(https://savannah.nongnu.org/projects/guile-llama-cpp) - Swift:srgtuszy/llama-cpp-swift(https://github.com/srgtuszy/llama-cpp-swift) - Swift:ShenghaiWang/SwiftLlama(https://github.com/ShenghaiWang/SwiftLlama) - Delphi:Embarcadero/llama-cpp-delphi(https://github.com/Embarcadero/llama-cpp-delphi) - Go(无需 CGo):hybridgroup/yzma(https://github.com/hybridgroup/yzma) - Android:llama.android 用户界面 (项目需明确声明依赖 llama.cpp 方可列入) - AI Sublime Text 插件(https://github.com/yaroslavyaroslav/OpenAI-sublime-text)(MIT) - BonzAI App(https://apps.apple.com/us/app/bonzai-your-local-ai-agent/id6752847988)(专有) - cztomsik/ava(https://github.com/cztomsik/ava)(MIT) - Dot(https://github.com/alexpinel/Dot)(GPL) - eva(https://github.com/ylsdamxssjxxdd/eva)(MIT) - iohub/collama(https://github.com/iohub/coLLaMA)(Apache-2.0) - janhq/jan(https://github.com/janhq/jan)(AGPL) - johnbean393/Sidekick(https://github.com/johnbean393/Sidekick)(MIT) - KanTV(https://github.com/zhouwg/kantv?tab=readme-ov-file)(Apache-2.0) - KodiBot(https://github.com/firatkiral/kodibot)(GPL) - llama.vim(https://github.com/ggml-org/llama.vim)(MIT) - LARS(https://github.com/abgulati/LARS)(AGPL) - Llama Assistant(https://github.com/vietanhdev/llama-assistant)(GPL) - LlamaLib(https://github.com/undreamai/LlamaLib)(Apache-2.0) - LLMFarm(https://github.com/guinmoon/LLMFarm?tab=readme-ov-file)(MIT) - LLMUnity(https://github.com/undreamai/LLMUnity)(MIT) - LMStudio(https://lmstudio.ai/)(专有) - LocalAI(https://github.com/mudler/LocalAI)(MIT) - LostRuins/koboldcpp(https://github.com/LostRuins/koboldcpp)(AGPL) - MindMac(https://mindmac.app)(专有) - MindWorkAI/AI-Studio(https://github.com/MindWorkAI/AI-Studio)(FSL-1.1-MIT) - Mobile-Artificial-Intelligence/maid(https://github.com/Mobile-Artificial-Intelligence/maid)(MIT) - Mozilla-Ocho/llamafile(https://github.com/Mozilla-Ocho/llamafile)(Apache-2.0) - nat/openplayground(https://github.com/nat/openplayground)(MIT) - nomic-ai/gpt4all(https://github.com/nomic-ai/gpt4all)(MIT) - ollama/ollama(https://github.com/ollama/ollama)(MIT) - oobabooga/text-generation-webui(https://github.com/oobabooga/text-generation-webui)(AGPL) - PocketPal AI(https://github.com/a-ghorbani/pocketpal-ai)(MIT) - psugihara/FreeChat(https://github.com/psugihara/FreeChat)(MIT) - ptsochantaris/emeltal(https://github.com/ptsochantaris/emeltal)(MIT) - pythops/tenere(https://github.com/pythops/tenere)(AGPL) - ramalama(https://github.com/containers/ramalama)(MIT) - semperai/amica(https://github.com/semperai/amica)(MIT) - withcatai/catai(https://github.com/withcatai/catai)(MIT) - Autopen(https://github.com/blackhole89/autopen)(GPL) 工具 - akx/ggify(https://github.com/akx/ggify)——从 Hugging Face Hub 下载 PyTorch 模型并转换为 GGML - akx/ollama-dl(https://github.com/akx/ollama-dl)——下载 Ollama 库中的模型供 llama.cpp 直接使用 - crashr/gppm(https://github.com/crashr/gppm)——利用 NVIDIA Tesla P40 或 P100 GPU 启动 llama.cpp 实例,降低待机功耗 - gpustack/gguf-parser(https://github.com/gpustack/gguf-parser-go/tree/main/cmd/gguf-parser)——查看/校验 GGUF 文件并估算内存占用 - Styled Lines(https://marketplace.unity.com/packages/tools/generative-ai/styled-lines-llama-cpp-model-292902)(专有授权,Unity3D 游戏开发异步推理封装,含预构建移动与 Web 平台包装及示例模型) - unslothai/unsloth(https://github.com/unslothai/unsloth)——🦥 将微调与训练后的模型导出/保存为 GGUF(Apache-2.0) 基础设施 - Paddler(https://github.com/intentee/paddler)——开源 LLMOps 平台,在自有基础设施托管与扩展 AI - GPUStack(https://github.com/gpustack/gpus
相似文章
ggml-webgpu: 提升k-quants的预填充速度并重构Q4/Q5/Q8及k-quants的矩阵乘法 by yomaytk · Pull Request #24225 · ggml-org/llama.cpp
提升了k-quants的预填充速度,并重构了llama.cpp WebGPU后端中Q4/Q5/Q8及k-quants的矩阵乘法。
[llama.cpp] 非对称 KV q8/q4 缓存:当前注意事项及 GGML 仓库中的讨论
讨论了在 llama.cpp 中使用非对称 KV 缓存量化时的注意事项,其中不匹配的 q8/q4 类型会导致提示处理在 CPU 而非 GPU 上进行,并提出了通过编译标志进行修复的方案。
SYCL: 从 CUDA 后端移植多列 MMVQ(在 Intel Arc 上获得约 45% 的推测解码加速)by masonmilby · Pull Request #21845 · ggml-org/llama.cpp
一个针对 llama.cpp 的拉取请求,将多列 MMVQ 从 CUDA 移植到 SYCL,在 Intel Arc GPU 上实现了约 45% 的推测解码加速。
ggml-org/llama.cpp
llama.cpp 是一个开源 C/C++ 库,用于在本地硬件上高效运行 LLM 推理,支持多种量化方法和多后端(CPU、GPU 等)。
提示:使用这个llama.cpp的PR提升Intel ARC上的提示处理速度
一个llama.cpp的PR显著提升了Intel ARC GPU上的提示处理速度,基准测试显示在B580上从245t/s提升到462t/s。目前该改进仅适用于F16 KV量化,计划后续支持其他量化方式。