Tag
A step-by-step guide to building a minimal AI coding agent that runs entirely locally using llama.cpp, GGUF models, and a custom harness, demonstrating how to set up tools and call a model to execute real tasks like creating a landing page.
After two months of local LLM testing, the author finds that the combination of gemma-4-12B-it-QAT and MTP assistance performs best in speed and usability, with hardware i7-13700 + 64GB RAM + RTX 4070.
The user tested the community fine-tuned gemma-4-12B-coder against Qwen3.6-35B-A3B MoE on three programming tasks, finding that gemma performed poorly on complex stateful programs, while Qwen 35B remained robust.
Microsoft released FastContext-1.0, an open-source repo scout that runs locally with llama.cpp to reduce Copilot costs by scanning files and providing only relevant context to the main agent.
A guide to the best local LLMs for consumer GPUs as of June 2026, using llama.cpp to run models like Gemma 4-12B, Qwen3.6-27B, and Nex-N2-Mini on 8-32GB VRAM, with setup and launch commands.
Work-in-progress implementation of EAGLE3 speculative decoding for Qwen models in llama.cpp.
Pull request to add architecture support for the cohere2-MoE model to llama.cpp, enabling inference of this Mixture of Experts model.
Feral v0.2.0 is an open-source local AI workspace that runs GGUF models via llama.cpp, supports BYOK for cloud models, includes an agent runtime with sandboxed tools and a knowledge graph, and now ships on Windows, macOS, and Linux with no telemetry or subscription.
A detailed tutorial on setting up a local coding agent on macOS using Gemma 4 with MTP draft model and llama.cpp, achieving ~24% speed improvement through speculative decoding.
PWA support has been merged into llama.cpp, allowing the llama-server web UI to be installed as a native-like app with standalone window mode and proper icons.
Tweet reminding developers they can run coding agents locally using llama.cpp and OpenCode for fast, reliable, and private inference, demonstrating with UnslothAI's North-Mini-Code-1.0-GGUF model.
A detailed technical exploration of MTP speculative decoding in llama.cpp with Gemma 4 models, showing that assistant model selection and quantization significantly impact speedups, and that not all 'same name' assistants perform equally.
EAGLE3, a speculative decoding method, has been integrated into llama.cpp, enabling faster inference.
A user benchmarks thread count for hybrid CPU-GPU inference with Gemma 4 in llama.cpp, discovering a 80% performance uplift by using 16 threads instead of 6 on a hybrid core CPU, and shares the optimal command configuration.
User seeks advice on preventing llama.cpp from offloading KV cache to swap before RAM is fully exhausted, sharing their configuration on an M2 Max with 96GB RAM and a large Qwen model.
A pull request for llama.cpp that removes padding and multiple device-to-device copies for Multi-Token Prediction (MTP), improving performance on GPU.
Unsloth releases GGUF quantizations of Google DeepMind's DiffusionGemma (26B-A4B), a new block-diffusion architecture for faster text generation, ready for llama.cpp.
Gemma 4 26B runs on an RTX 4060 with 248K token context at 20 tokens per second using llama.cpp and Q4_K_XL quantization, enabling local processing of entire codebases on consumer hardware.
User seeks advice on cost-effective cloud GPU workflows for short LLM testing sessions, highlighting storage fees as a key pain point when preserving environments between runs.
This page hosts GGUF quantized versions of Cohere's North-Mini-Code-1.0 model, a 30B-A3B MoE model optimized for code generation and agentic tasks. Instructions are provided for building llama.cpp from a specific PR to support the cohere2moe architecture.