Tag
This page hosts GGUF quantized versions of Cohere's North-Mini-Code-1.0 model, a 30B-A3B MoE model optimized for code generation and agentic tasks. Instructions are provided for building llama.cpp from a specific PR to support the cohere2moe architecture.
A static Linux command builder for llama.cpp that helps construct CLI commands, run benchmarks, and log results.
Testing shows that default pipeline parallelism in llama.cpp wastes VRAM with no speed benefit; compiling with GGML_SCHED_MAX_COPIES=1 saves significant VRAM while maintaining identical inference speed.
Benchmark results showing 1.2-1.8x token-per-second speedups on Gemma 4 models (12B and 26B) using QAT and MTP on a 24GB RTX 3090 GPU.
This pull request adds video input support to llama.cpp, enabling multimodal models to process video data via the new mtmd component.
This pull request by ggerganov optimizes kv-cache in llama.cpp to avoid unnecessary copies of kv cells, improving inference performance. It is a contribution to the open-source LLM inference library llama.cpp.
Changing quantization from q4_k_m to q4_k_xl in llama.cpp doubles inference speed on the same GPU without hardware or driver changes, as demonstrated with Gemma 4 12B on an RTX 4060.
Steeve Morin reports that after 5 days of work, his implementation is now within 10% of llama.cpp's speed, achieving 64 tok/s vs 70 tok/s, with more work to do.
A user seeks clarification on the relation between MTP (Multi-Token Prediction) and QAT (Quantization-Aware Training) in llama.cpp, particularly regarding GGUF compatibility for the Gemma4 model and the new QAT string in filenames.
Gemma 4 MTP has been merged into llama.cpp, enabling lightweight and fast inference with Gemma 4 QAT and MTP.
Alok demonstrates running Gemma 4 26B MoE on 8GB VRAM using Unsloth's QAT quant and the -cmoe flag in llama.cpp, achieving 20 tokens/sec with 250k context, marking a major milestone for budget local AI.
Google's Gemma 4 12B QAT model achieves 120 tok/s on a 12GB GPU using Multi-Token Prediction (MTP) with llama.cpp. A step-by-step guide and benchmark comparison without MTP show a 2x speedup.
A pull request for llama.cpp ports multi-column MMVQ from CUDA to SYCL, achieving approximately 45% speculative decoding speedup on Intel Arc GPUs.
Gemma 4 12B has a known issue with tool calling and coding, but using a custom chat template in llama.cpp resolves the bugs. Users should compile llama.cpp from source and apply the fix before evaluating the model's coding ability.
GenBench is a free iOS app that lets users download, run, and benchmark GGUF models on iPhone/iPad using llama.cpp and Metal, with features like offline chat, standardized benchmarks, and a global leaderboard.
A user shares their experience offloading the KV cache to RAM in llama.cpp, achieving comparable speeds while freeing VRAM for larger models and context windows, suggesting this trade-off is often worthwhile.
This pull request adds support for the Granite4 Vision model to llama.cpp, an open-source LLM inference engine.
Detailed breakdown of running Qwen 3.5 122B MoE on a single RTX 3090 at 35 t/s using a custom llama.cpp fork (ik_llama.cpp) with fused MoE operations and expert offloading to CPU RAM, significantly outperforming stock llama.cpp MTP.
A user shares performance benchmarks comparing the Nvidia RTX Pro 4500 Blackwell 32GB GPU against the RTX 5060 Ti 16GB for AI inference, showing 1.6-6x speed improvements depending on model size and quantization.
The author introduces an open-source GGUF quantizer tool for llama.cpp that creates NVFP4 and MXFP6 quantized models with advanced techniques like RSF, tensor promotion, and dynamic quantization, achieving better quality than existing methods like ModelOpt.