Tag
A developer has implemented a proof-of-concept PR for llama.cpp that adds dynamic KV cache quantization via an HTTP endpoint, allowing users to requantize their KV cache on-demand without fully reloading the model. The post also outlines a wishlist including load-on-demand mmproj/MTP swapping and an automatic --fit flag for context optimization.
A user discovered that a hidden PCIe 2.0 x4 electrical limitation on a Threadripper workstation board was crippling one of four RTX 3090s, causing poor multi-GPU LLM inference performance. Fixing the slot layout and switching to tensor split mode doubled Mistral 128B throughput from ~11 to ~24.7 tok/s.
llama.cpp maintainers and NVIDIA engineers collaborated to significantly improve multi-GPU performance in ggml, enabling hardware-agnostic tensor parallelism and major performance gains on RTX systems.
A user tested Gemma 4 12B as a coding agent in VSCodium using Pi Agent extension, successfully performing a task to create a Python script that reads logs and outputs JSON. The model handled tool use autonomously with zero bugs.
llama.cpp releases version b9495 with optimizations for Qwen3.6/3.5-MTP (Multi-Token Prediction) and requests users to share their benchmark results with full command details.
User shares their experience using llama.cpp with the GGUF Q4_K_M quantized version of Gemma-4-12b on a Mac, achieving local inference speed of about 36 tok/s and memory usage of about 10GB.
Adds support for rendering Mermaid diagrams in chat and an interactive preview within the llama.cpp web UI.
Built a Tauri v2 desktop chat shell for local LLMs that can connect to Ollama, llama.cpp, or any OpenAI-compatible endpoint. The project is MIT licensed and produces a ~12 MB binary.
User reports a CUDA error when using tensor split mode with the latest llama.cpp and Qwen-3.6-27b model on dual RTX 3090s with Ubuntu Server 24.04 and Docker.
Pull request adding support for StepFun 3.5 MTP model in llama.cpp.
A deep benchmark of 8 tiny LLMs (135M to 1B parameters) on a $250 Jetson Orin Nano Super across four power modes finds 25W to be Pareto-optimal, with SmolLM2-135M achieving 165.1 tok/s and best efficiency.
Benchmark results for Intel Arc Pro B70 GPU running llama.cpp with SYCL on Qwen models show 63 tokens per second performance.
A user shares their positive experience using Qwen 3.6 27B locally for complex research and coding, finding it outperforms Gemini Pro in career advice and immigration research, while also noting performance issues with Gemma 4 31B.
A blog post detailing how to run the Gemma 4 AI model on a 10-year-old Xeon server with only CPU and DDR3 RAM, using customized llama.cpp optimizations.
A user shares their experience of adding an old NVIDIA 2070 Super GPU to their rig for extra VRAM, enabling them to run larger LLMs like Qwen3.6-27B at high quantization and context size with good performance, and now considering upgrading to a 3090 for even more VRAM.
User benchmarks show no significant speed difference between Windows 11 and Linux when running large MoE models with llama.cpp, debunking a common myth. Tests on a multi-GPU setup with models like Qwen 3.5 122B, 397B, and MiniMax 2.7 yield nearly identical prompt processing and token generation speeds.
Mudler released APEX-MTP GGUF quantizations of the Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled model, bundling the multi-token prediction head for self-speculative decoding with llama.cpp.
The MINISFORUM UM790 Pro is highlighted as a budget mini PC for local AI inference using llama.cpp and vLLM.
Llama.cpp announces a new website and unified 'llama' binary for simpler LLM inference, along with updates like Hugging Face cache migration and multimodal support.
Llama.cpp release B9406 fixes a crash (GGML_ASSERT) when using MTP with MoE vision models like Qwen3.6-35B-A3B.