llama-cpp

#llama-cpp

Dynamic KV Cache Quantization and Load-on-demand mmproj/MTP: my llama.cpp wishlist

Reddit r/LocalLLaMA ↗ · 2026-06-04

A developer has implemented a proof-of-concept PR for llama.cpp that adds dynamic KV cache quantization via an HTTP endpoint, allowing users to requantize their KV cache on-demand without fully reloading the model. The post also outlines a wishlist including load-on-demand mmproj/MTP swapping and an automatic --fit flag for context optimization.

0 favorites 0 likes

#llama-cpp

I accidentally crippled my 4x RTX 3090 LLM rig with a hidden PCIe 2.0 x4 slot and fixing it doubled Mistral 128B performance

Reddit r/LocalLLaMA ↗ · 2026-06-04

A user discovered that a hidden PCIe 2.0 x4 electrical limitation on a Threadripper workstation board was crippling one of four RTX 3090s, causing poor multi-GPU LLM inference performance. Fixing the slot layout and switching to tensor split mode doubled Mistral 128B throughput from ~11 to ~24.7 tok/s.

0 favorites 0 likes

#llama-cpp

@ggerganov: Highlighting recent advances in multi-GPU and tensor parallel support in llama.cpp Over the last few months llama.cpp m…

X AI KOLs Following ↗ · 2026-06-04

llama.cpp maintainers and NVIDIA engineers collaborated to significantly improve multi-GPU performance in ggml, enabling hardware-agnostic tensor parallelism and major performance gains on RTX systems.

0 favorites 0 likes

#llama-cpp

Gemma 4 12B first coding agent test on a 4080 Super

Reddit r/LocalLLaMA ↗ · 2026-06-03

A user tested Gemma 4 12B as a coding agent in VSCodium using Pi Agent extension, successfully performing a task to create a Python script that reads logs and outputs JSON. The model handled tool use autonomously with zero bugs.

0 favorites 0 likes

#llama-cpp

llama.cpp - Qwen3.6/3.5-MTP - Share your benchmarks t/s

Reddit r/LocalLLaMA ↗ · 2026-06-03

llama.cpp releases version b9495 with optimizations for Qwen3.6/3.5-MTP (Multi-Token Prediction) and requests users to share their benchmark results with full command details.

0 favorites 0 likes

#llama-cpp

@mylifcc: I'm already running Gemma-4-12b on my Mac. Tech stack: llama.cpp + GGUF Q4_K_M + Metal 32K context, local OpenAI-compatible API. Measured about 36 tok/s, resident RSS about…

X AI KOLs Timeline ↗ · 2026-06-03 Cached

User shares their experience using llama.cpp with the GGUF Q4_K_M quantized version of Gemma-4-12b on a Mac, achieving local inference speed of about 36 tok/s and memory usage of about 10GB.

0 favorites 0 likes

#llama-cpp

ui: Mermaid Diagrams in chat + interactive preview by allozaur · Pull Request #24032 · ggml-org/llama.cpp

Reddit r/LocalLLaMA ↗ · 2026-06-03 Cached

Adds support for rendering Mermaid diagrams in chat and an interactive preview within the llama.cpp web UI.

0 favorites 0 likes

#llama-cpp

Built a Tauri v2 desktop chat shell for local LLMs — point it at Ollama / llama.cpp / any OpenAI-compatible endpoint, MIT, ~12 MB binary

Reddit r/LocalLLaMA ↗ · 2026-06-03 Cached

Built a Tauri v2 desktop chat shell for local LLMs that can connect to Ollama, llama.cpp, or any OpenAI-compatible endpoint. The project is MIT licensed and produces a ~12 MB binary.

0 favorites 0 likes

#llama-cpp

Tensor split mode: CUDA error on latest llama.cpp with Qwen-3.6-27b

Reddit r/LocalLLaMA ↗ · 2026-06-03

User reports a CUDA error when using tensor split mode with the latest llama.cpp and Qwen-3.6-27b model on dual RTX 3090s with Ubuntu Server 24.04 and Docker.

0 favorites 0 likes

#llama-cpp

StepFun 3.5 MTP by pwilkin · Pull Request #23274 · ggml-org/llama.cpp

Reddit r/LocalLLaMA ↗ · 2026-06-02 Cached

Pull request adding support for StepFun 3.5 MTP model in llama.cpp.

0 favorites 0 likes

#llama-cpp

Tiny LLM Benchmark: Jetson Orin Nano Super 8GB - Four Power Modes × Eight Models

Reddit r/LocalLLaMA ↗ · 2026-06-02

A deep benchmark of 8 tiny LLMs (135M to 1B parameters) on a $250 Jetson Orin Nano Super across four power modes finds 25W to be Pareto-optimal, with SmolLM2-135M achieving 165.1 tok/s and best efficiency.

0 favorites 0 likes

#llama-cpp

Intel Arc Pro B70 llama.cpp benchmarks posted

Reddit r/LocalLLaMA ↗ · 2026-06-02

Benchmark results for Intel Arc Pro B70 GPU running llama.cpp with SYCL on Qwen models show 63 tokens per second performance.

0 favorites 0 likes

#llama-cpp

Qwen 3.6 27B kick balls

Reddit r/LocalLLaMA ↗ · 2026-06-01

A user shares their positive experience using Qwen 3.6 27B locally for complex research and coding, finding it outperforms Gemini Pro in career advice and immigration research, while also noting performance issues with Gemma 4 31B.

0 favorites 0 likes

#llama-cpp

A 10 year old Xeon is all you need

Hacker News Top ↗ · 2026-06-01 Cached

A blog post detailing how to run the Gemma 4 AI model on a 10-year-old Xeon server with only CPU and DDR3 RAM, using customized llama.cpp optimizations.

0 favorites 0 likes

#llama-cpp

Added an old 2070 Super to my rig and I can't go back...worse, now I need more

Reddit r/LocalLLaMA ↗ · 2026-05-31

A user shares their experience of adding an old NVIDIA 2070 Super GPU to their rig for extra VRAM, enabling them to run larger LLMs like Qwen3.6-27B at high quantization and context size with good performance, and now considering upgrading to a 3090 for even more VRAM.

0 favorites 0 likes

#llama-cpp

Speed difference between Windows 11 and Linux with llama.cpp: a myth when using medium and large MoE models

Reddit r/LocalLLaMA ↗ · 2026-05-31

User benchmarks show no significant speed difference between Windows 11 and Linux when running large MoE models with llama.cpp, debunking a common myth. Tests on a multi-GPU setup with models like Qwen 3.5 122B, 397B, and MiniMax 2.7 yield nearly identical prompt processing and token generation speeds.

0 favorites 0 likes

#llama-cpp

mudler/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-MTP-GGUF just released !

Reddit r/LocalLLaMA ↗ · 2026-05-31

Mudler released APEX-MTP GGUF quantizations of the Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled model, bundling the multi-token prediction head for self-speculative decoding with llama.cpp.

0 favorites 0 likes

#llama-cpp

MINISFORUM UM790 Pro

Reddit r/LocalLLaMA ↗ · 2026-05-30

The MINISFORUM UM790 Pro is highlighted as a budget mini PC for local AI inference using llama.cpp and vLLM.

0 favorites 0 likes

#llama-cpp

llama : website + unified `llama` binary · ggml-org/llama.cpp · Discussion #23875

Reddit r/LocalLLaMA ↗ · 2026-05-29 Cached

Llama.cpp announces a new website and unified 'llama' binary for simpler LLM inference, along with updates like Hugging Face cache migration and multimodal support.

0 favorites 0 likes

#llama-cpp

Llama.cpp B9406 MTP mmproj fix

Reddit r/LocalLLaMA ↗ · 2026-05-29

Llama.cpp release B9406 fixes a crash (GGML_ASSERT) when using MTP with MoE vision models like Qwen3.6-35B-A3B.

0 favorites 0 likes

llama-cpp

Submit Feedback