llama-cpp

#llama-cpp

unsloth/North-Mini-Code-1.0-GGUF · Hugging Face

Reddit r/LocalLLaMA ↗ · 2026-06-10 Cached

This page hosts GGUF quantized versions of Cohere's North-Mini-Code-1.0 model, a 30B-A3B MoE model optimized for code generation and agentic tasks. Instructions are provided for building llama.cpp from a specific PR to support the cohere2moe architecture.

0 favorites 0 likes

#llama-cpp

Here's a llama.cpp CLI Command builder.

Reddit r/LocalLLaMA ↗ · 2026-06-09 Cached

A static Linux command builder for llama.cpp that helps construct CLI commands, run benchmarks, and log results.

0 favorites 0 likes

#llama-cpp

Pipeline parallelism in llama.cpp may be wasting your VRAM

Reddit r/LocalLLaMA ↗ · 2026-06-08

Testing shows that default pipeline parallelism in llama.cpp wastes VRAM with no speed benefit; compiling with GGML_SCHED_MAX_COPIES=1 saves significant VRAM while maintaining identical inference speed.

0 favorites 0 likes

#llama-cpp

[3090] Gemma4 QAT + MTP quick TPS numbers [TLDR 1.2-1.8x better]

Reddit r/LocalLLaMA ↗ · 2026-06-08

Benchmark results showing 1.2-1.8x token-per-second speedups on Gemma 4 models (12B and 26B) using QAT and MTP on a 24GB RTX 3090 GPU.

0 favorites 0 likes

#llama-cpp

mtmd : add video input support by ngxson · Pull Request #24269 · ggml-org/llama.cpp

Reddit r/LocalLLaMA ↗ · 2026-06-08 Cached

This pull request adds video input support to llama.cpp, enabling multimodal models to process video data via the new mtmd component.

0 favorites 0 likes

#llama-cpp

kv-cache : avoid kv cells copies by ggerganov · Pull Request #24277 · ggml-org/llama.cpp

Reddit r/LocalLLaMA ↗ · 2026-06-08 Cached

This pull request by ggerganov optimizes kv-cache in llama.cpp to avoid unnecessary copies of kv cells, improving inference performance. It is a contribution to the open-source LLM inference library llama.cpp.

0 favorites 0 likes

#llama-cpp

@leopardracer: SAME GPU SAME MODEL SAME CONTEXT AND 2X THE SPEED rtx 4060, gemma 4 12b, 48k context just switched the quantization fro…

X AI KOLs Timeline ↗ · 2026-06-08 Cached

Changing quantization from q4_k_m to q4_k_xl in llama.cpp doubles inference speed on the same GPU without hardware or driver changes, as demonstrated with Gemma 4 12B on an RTX 4060.

0 favorites 0 likes

#llama-cpp

@steeve: aaaaaand we're faster (i know i know)

X AI KOLs Following ↗ · 2026-06-08 Cached

Steeve Morin reports that after 5 days of work, his implementation is now within 10% of llama.cpp's speed, achieving 64 tok/s vs 70 tok/s, with more work to do.

0 favorites 0 likes

#llama-cpp

MTP and QTA - what is the relation?

Reddit r/LocalLLaMA ↗ · 2026-06-07

A user seeks clarification on the relation between MTP (Multi-Token Prediction) and QAT (Quantization-Aware Training) in llama.cpp, particularly regarding GGUF compatibility for the Gemma4 model and the new QAT string in filenames.

0 favorites 0 likes

#llama-cpp

@osanseviero: Gemma 4 MTP just got officially merged into llama.cpp This means you can use Gemma 4 QAT + MTP for a lightweight + supe…

X AI KOLs Following ↗ · 2026-06-07 Cached

Gemma 4 MTP has been merged into llama.cpp, enabling lightweight and fast inference with Gemma 4 QAT and MTP.

0 favorites 0 likes

#llama-cpp

@analogalok: Run Gemma 4 26B MoE on 8GB VRAM with 250k context at 20+ tokens/sec If you own any 8GB VRAM graphics card, stop what yo…

X AI KOLs Timeline ↗ · 2026-06-07 Cached

Alok demonstrates running Gemma 4 26B MoE on 8GB VRAM using Unsloth's QAT quant and the -cmoe flag in llama.cpp, achieving 20 tokens/sec with 250k context, marking a major milestone for budget local AI.

0 favorites 0 likes

#llama-cpp

120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP

Reddit r/LocalLLaMA ↗ · 2026-06-06

Google's Gemma 4 12B QAT model achieves 120 tok/s on a 12GB GPU using Multi-Token Prediction (MTP) with llama.cpp. A step-by-step guide and benchmark comparison without MTP show a 2x speedup.

0 favorites 0 likes

#llama-cpp

sycl : port multi-column MMVQ from CUDA backend (~45% speculative decoding speedup on Intel Arc) by masonmilby · Pull Request #21845 · ggml-org/llama.cpp

Reddit r/LocalLLaMA ↗ · 2026-06-05 Cached

A pull request for llama.cpp ports multi-column MMVQ from CUDA to SYCL, achieving approximately 45% speculative decoding speedup on Intel Arc GPUs.

0 favorites 0 likes

#llama-cpp

PSA: Gemma 4 12B is NOT completely broken for coding and tool calling, you need a special chat template

Reddit r/LocalLLaMA ↗ · 2026-06-05

Gemma 4 12B has a known issue with tool calling and coding, but using a custom chat template in llama.cpp resolves the bugs. Users should compile llama.cpp from source and apply the fix before evaluating the model's coding ability.

0 favorites 0 likes

#llama-cpp

I built a iOS app to benchmark GGUF models on your iPhone/iPad

Reddit r/LocalLLaMA ↗ · 2026-06-05

GenBench is a free iOS app that lets users download, run, and benchmark GGUF models on iPhone/iPad using llama.cpp and Metal, with features like offline chat, standardized benchmarks, and a global leaderboard.

0 favorites 0 likes

#llama-cpp

Maybe KV cache offload to RAM isn't bad

Reddit r/LocalLLaMA ↗ · 2026-06-05

A user shares their experience offloading the KV cache to RAM in llama.cpp, achieving comparable speeds while freeing VRAM for larger models and context windows, suggesting this trade-off is often worthwhile.

0 favorites 0 likes

#llama-cpp

model: Granite4 Vision by gabe-l-hart · Pull Request #23545 · ggml-org/llama.cpp

Reddit r/LocalLLaMA ↗ · 2026-06-05 Cached

This pull request adds support for the Granite4 Vision model to llama.cpp, an open-source LLM inference engine.

0 favorites 0 likes

#llama-cpp

Qwen 3.5 122B MoE OC on a single 3090 at 35 t/s — full local stack breakdown

Reddit r/openclaw ↗ · 2026-06-05

Detailed breakdown of running Qwen 3.5 122B MoE on a single RTX 3090 at 35 t/s using a custom llama.cpp fork (ik_llama.cpp) with fused MoE operations and expert offloading to CPU RAM, significantly outperforming stock llama.cpp MTP.

0 favorites 0 likes

#llama-cpp

RTX Pro 4500 Blackwell Performance Numbers

Reddit r/LocalLLaMA ↗ · 2026-06-05

A user shares performance benchmarks comparing the Nvidia RTX Pro 4500 Blackwell 32GB GPU against the RTX 5060 Ti 16GB for AI inference, showing 1.6-6x speed improvements depending on model size and quantization.

0 favorites 0 likes

#llama-cpp

Here is my llama.cpp NVFP4/MXFP6 GGUF quantizer tool

Reddit r/LocalLLaMA ↗ · 2026-06-05

The author introduces an open-source GGUF quantizer tool for llama.cpp that creates NVFP4 and MXFP6 quantized models with advanced techniques like RSF, tensor promotion, and dynamic quantization, achieving better quality than existing methods like ModelOpt.

0 favorites 0 likes

llama-cpp

Submit Feedback