llama-cpp

#llama-cpp

MiniMax-M3-EAGLE3-GGUF - Llama.cpp compatible MiniMax M3 EAGLE draft model!

Reddit r/LocalLLaMA ↗ · yesterday

A GGUF conversion of MiniMax M3's EAGLE draft model for llama.cpp is now available, enabling speculative decoding speedups on compatible hardware.

0 favorites 0 likes

#llama-cpp

Qt Creator 20 and local AI

Reddit r/LocalLLaMA ↗ · yesterday Cached

Qt Creator 20 now supports local AI coding assistants via the Agent Client Protocol, enabling integration with open-weight models like GPT-OSS and Gemma 4 running on consumer hardware.

0 favorites 0 likes

#llama-cpp

Gemma4-12B-QAT Uncensored Balanced is out with MTP (~60% speed boost)!

Reddit r/LocalLLaMA ↗ · yesterday

Release of Gemma4-12B-QAT Uncensored Balanced, a fine-tuned uncensored model with a multi-token-prediction draft head for ~60% faster speculative decoding, optimized for llama.cpp and offering vision support.

0 favorites 0 likes

#llama-cpp

GLM-5.2 UD-IQ1_M on llama.cpp — 5090 + 3090 Ti speed test (~ 579 t/s prefill @ 8k ctx, ~324 t/s prefill @ 57k ctx, ~10.6 t/s decode)

Reddit r/LocalLLaMA ↗ · yesterday

Speed test results for GLM-5.2 running on llama.cpp with RTX 5090 and RTX 3090 Ti, showing prefill speeds up to 579 t/s at 8k context and decode at ~10.6 t/s.

0 favorites 0 likes

#llama-cpp

Qwen3.6-35B-A3B APEX on a Single RTX 3090 - Getting the Most Out of It

Reddit r/LocalLLaMA ↗ · yesterday

A detailed guide on running the Qwen3.6-35B-A3B APEX model on an RTX 3090, comparing two llama.cpp forks and quantization methods for optimal speed and quality.

0 favorites 0 likes

#llama-cpp

Local LLM Inference Optimization: The Complete Guide

Reddit r/LocalLLaMA ↗ · 2d ago Cached

A comprehensive guide to optimizing local LLM inference on consumer hardware, covering tools like llama.cpp, vLLM, and LM Studio, with practical advice on memory hierarchy, layer placement, and common failure modes.

0 favorites 0 likes

#llama-cpp

@TheAhmadOsman: Why do I focus on Inference Engines/Software Stacks for your hardware? - 2x RTX 3090s: ~14.5 tok/s → ~64 tok/s moving t…

X AI KOLs Following ↗ · 2d ago Cached

Comparison of inference engine performance on different hardware: moving from baseline to vLLM with TP=2 on 2x RTX 3090s improves from ~14.5 tok/s to ~64 tok/s, and on RTX PRO 6000 moving to Sglang improves from ~32 tok/s to ~110 tok/s. Recommends vLLM/Sglang for CUDA/multi-GPU and llama.cpp for edge devices.

0 favorites 0 likes

#llama-cpp

Best local model for vision - 2nd benchmark update - 21 Jun 2026

Reddit r/LocalLLaMA ↗ · 2d ago

This post presents the second update of a benchmark for local vision language models, comparing 23 models across 30 images with revised settings, and provides performance recommendations for different VRAM tiers. Key findings include that thinking mode hurts vision performance and that MoE models underperform dense models for perception tasks.

0 favorites 0 likes

#llama-cpp

2× Radeon R9700 — Qwen 3.6 27B Q8 MTP on llama.cpp

Reddit r/LocalLLaMA ↗ · 2d ago

Technical report on running Qwen 3.6 27B Q8 model on a dual AMD Radeon R9700 setup using llama.cpp with ROCm, including performance benchmarks and configuration details.

0 favorites 0 likes

#llama-cpp

@analogalok: gemma-4-12B-agentic-fable5-composer2.5 V2 is out. the agentic upgrade to the model trained on Fable 5's reasoning. Runn…

X AI KOLs Timeline ↗ · 2d ago Cached

A new fine-tuned version of Gemma 4 12B, trained on Fable 5's reasoning, achieves a significant jump in agentic coding benchmarks (from 15% to 55%) and can run locally on an 8GB VRAM GPU using a custom fork of llama.cpp.

0 favorites 0 likes

#llama-cpp

7900XTX 24GB vram, can finally fit Q6K+MTP with Qwen 3.6 27B at 131k context

Reddit r/LocalLLaMA ↗ · 3d ago

A guide on optimizing VRAM usage on an AMD 7900XTX to run a 27B Qwen model with Q6K quantization and 131k context by compiling llama.cpp with OpenBLAS and CUDA_FA_ALL_QUANTS, and using kvcache quantization at q5_0/q4_0.

0 favorites 0 likes

#llama-cpp

GLM-5.2 can now run locally in llama.cpp and Unsloth Studio.

Reddit r/LocalLLaMA ↗ · 4d ago

GLM-5.2 is now supported for local execution via llama.cpp and Unsloth Studio.

0 favorites 0 likes

#llama-cpp

llama.cpp now supports model management (downloading etc) via API

Reddit r/LocalLLaMA ↗ · 6d ago

llama.cpp now supports model management including downloading and lifecycle management via its API, allowing full deployment without external tools.

0 favorites 0 likes

#llama-cpp

llama.cpp - how to free up even more space on your GPU

Reddit r/LocalLLaMA ↗ · 6d ago

A thread sharing practical tips for freeing up GPU memory in llama.cpp, such as offloading mmproj to CPU and adjusting KV cache types, while discussing parameters like --cache-type-k/v and --spec-draft-n-max.

0 favorites 0 likes

#llama-cpp

@ItsmeAjayKV: Achievement Unlocked: Running Qwen3.6-27b dense Thanks to the RTX 3090, now I can do this. Running @Alibaba_Qwen Qwen 3…

X AI KOLs Timeline ↗ · 6d ago Cached

User benchmarks Qwen3.6-27B on an RTX 3090 using llama.cpp, achieving 35 tok/s generation and 1247 tok/s prompt processing.

0 favorites 0 likes

#llama-cpp

Local models went from mostly useless to actually useful really fast. What changed?

Reddit r/LocalLLaMA ↗ · 6d ago

The post notes that local AI models have become significantly more useful over the past year, moving from toys to practical tools for coding and workflows, despite still lagging behind closed models for complex tasks.

0 favorites 0 likes

#llama-cpp

@analogalok: my 8 GB VRAM gaming laptop is absolutely going to hate me for this. but I still did it. ran a 31b dense model (Gemma 4 …

X AI KOLs Timeline ↗ · 2026-06-16 Cached

User runs Gemma 4 31B dense model on 8GB VRAM gaming laptop at ~3 tokens/sec using llama.cpp with MTP speculative decoding, demonstrating feasibility of running a 31B dense model on consumer hardware and proposing agentic workflows where a fast MoE model routes to this slower dense model for hard tasks.

0 favorites 0 likes

#llama-cpp

Quoting Georgi Gerganov

Simon Willison's Blog ↗ · 2026-06-16 Cached

Georgi Gerganov attests that Qwen3.6-27B is a very capable local coding model, which he uses daily on his M2 Ultra or RTX 5090 with a lightweight harness.

0 favorites 0 likes

#llama-cpp

Stop using Ollama

Reddit r/LocalLLaMA ↗ · 2026-06-15 Cached

Ollama faces criticism for failing to properly credit the llama.cpp project it depends on, violating MIT license requirements, and taking venture capital funding while drifting from its local-first mission.

0 favorites 0 likes

#llama-cpp

Maybe dumb question, but how do you serve multiple users with the full context length?

Reddit r/LocalLLaMA ↗ · 2026-06-15

A user asks how llama.cpp can serve multiple users each with full context length, noting that it seems to only share the context pool rather than providing dedicated context per user.

0 favorites 0 likes

llama-cpp

Submit Feedback