Tag
A GGUF conversion of MiniMax M3's EAGLE draft model for llama.cpp is now available, enabling speculative decoding speedups on compatible hardware.
Qt Creator 20 now supports local AI coding assistants via the Agent Client Protocol, enabling integration with open-weight models like GPT-OSS and Gemma 4 running on consumer hardware.
Release of Gemma4-12B-QAT Uncensored Balanced, a fine-tuned uncensored model with a multi-token-prediction draft head for ~60% faster speculative decoding, optimized for llama.cpp and offering vision support.
Speed test results for GLM-5.2 running on llama.cpp with RTX 5090 and RTX 3090 Ti, showing prefill speeds up to 579 t/s at 8k context and decode at ~10.6 t/s.
A detailed guide on running the Qwen3.6-35B-A3B APEX model on an RTX 3090, comparing two llama.cpp forks and quantization methods for optimal speed and quality.
A comprehensive guide to optimizing local LLM inference on consumer hardware, covering tools like llama.cpp, vLLM, and LM Studio, with practical advice on memory hierarchy, layer placement, and common failure modes.
Comparison of inference engine performance on different hardware: moving from baseline to vLLM with TP=2 on 2x RTX 3090s improves from ~14.5 tok/s to ~64 tok/s, and on RTX PRO 6000 moving to Sglang improves from ~32 tok/s to ~110 tok/s. Recommends vLLM/Sglang for CUDA/multi-GPU and llama.cpp for edge devices.
This post presents the second update of a benchmark for local vision language models, comparing 23 models across 30 images with revised settings, and provides performance recommendations for different VRAM tiers. Key findings include that thinking mode hurts vision performance and that MoE models underperform dense models for perception tasks.
Technical report on running Qwen 3.6 27B Q8 model on a dual AMD Radeon R9700 setup using llama.cpp with ROCm, including performance benchmarks and configuration details.
A new fine-tuned version of Gemma 4 12B, trained on Fable 5's reasoning, achieves a significant jump in agentic coding benchmarks (from 15% to 55%) and can run locally on an 8GB VRAM GPU using a custom fork of llama.cpp.
A guide on optimizing VRAM usage on an AMD 7900XTX to run a 27B Qwen model with Q6K quantization and 131k context by compiling llama.cpp with OpenBLAS and CUDA_FA_ALL_QUANTS, and using kvcache quantization at q5_0/q4_0.
GLM-5.2 is now supported for local execution via llama.cpp and Unsloth Studio.
llama.cpp now supports model management including downloading and lifecycle management via its API, allowing full deployment without external tools.
A thread sharing practical tips for freeing up GPU memory in llama.cpp, such as offloading mmproj to CPU and adjusting KV cache types, while discussing parameters like --cache-type-k/v and --spec-draft-n-max.
User benchmarks Qwen3.6-27B on an RTX 3090 using llama.cpp, achieving 35 tok/s generation and 1247 tok/s prompt processing.
The post notes that local AI models have become significantly more useful over the past year, moving from toys to practical tools for coding and workflows, despite still lagging behind closed models for complex tasks.
User runs Gemma 4 31B dense model on 8GB VRAM gaming laptop at ~3 tokens/sec using llama.cpp with MTP speculative decoding, demonstrating feasibility of running a 31B dense model on consumer hardware and proposing agentic workflows where a fast MoE model routes to this slower dense model for hard tasks.
Georgi Gerganov attests that Qwen3.6-27B is a very capable local coding model, which he uses daily on his M2 Ultra or RTX 5090 with a lightweight harness.
Ollama faces criticism for failing to properly credit the llama.cpp project it depends on, violating MIT license requirements, and taking venture capital funding while drifting from its local-first mission.
A user asks how llama.cpp can serve multiple users each with full context length, noting that it seems to only share the context pool rather than providing dedicated context per user.