llama-cpp

#llama-cpp

Build a local AI coding agent from scratch

Reddit r/ArtificialInteligence ↗ · 2026-06-15 Cached

A step-by-step guide to building a minimal AI coding agent that runs entirely locally using llama.cpp, GGUF models, and a custom harness, demonstrating how to set up tools and call a model to execute real tasks like creating a landing page.

0 favorites 0 likes

#llama-cpp

@iluciddreaming: Played with local LLMs for two months. Extensively tested various open-source models using Windows 11 + llama.cpp + llama-swap. Here is my final report card: Hardware: i7-13700 + 64GB RAM + RTX 4070. The best combination currently is gemm…

X AI KOLs Timeline ↗ · 2026-06-15 Cached

After two months of local LLM testing, the author finds that the combination of gemma-4-12B-it-QAT and MTP assistance performs best in speed and usability, with hardware i7-13700 + 64GB RAM + RTX 4070.

0 favorites 0 likes

#llama-cpp

@zhixianio: Finished testing, feeling quite surprised, not sure if I'm using it wrong. Feel free to provide counterexamples. Here are my results: On M5 Max, pitting this community fine-tuned gemma-4-12B-coder (llama.cpp) against my daily driver Qwen3.6-35B-…

X AI KOLs Timeline ↗ · 2026-06-15 Cached

The user tested the community fine-tuned gemma-4-12B-coder against Qwen3.6-35B-A3B MoE on three programming tasks, finding that gemma performed poorly on complex stateful programs, while Qwen 35B remained robust.

0 favorites 0 likes

#llama-cpp

@iotcoi: Microsoft just dropped FastContext-1.0: an open-source repo scout to lower your Copilot bill GGUF on HF. Run it locally…

X AI KOLs Timeline ↗ · 2026-06-15 Cached

Microsoft released FastContext-1.0, an open-source repo scout that runs locally with llama.cpp to reduce Copilot costs by scanning files and providing only relevant context to the main agent.

0 favorites 0 likes

#llama-cpp

@TraffAlex: Best Local LLMs for Consumer GPUs — llama.cpp Guide (June 2026) What I actually run on consumer hardware right now. Eve…

X AI KOLs Timeline ↗ · 2026-06-14 Cached

A guide to the best local LLMs for consumer GPUs as of June 2026, using llama.cpp to run models like Gemma 4-12B, Qwen3.6-27B, and Nex-N2-Mini on 8-32GB VRAM, with setup and launch commands.

0 favorites 0 likes

#llama-cpp

WIP EAGLE3 for Qwens

Reddit r/LocalLLaMA ↗ · 2026-06-13 Cached

Work-in-progress implementation of EAGLE3 speculative decoding for Qwen models in llama.cpp.

0 favorites 0 likes

#llama-cpp

Add arch support for cohere2-MoE by michaelw9999 · Pull Request #24260 · ggml-org/llama.cpp

Reddit r/LocalLLaMA ↗ · 2026-06-13 Cached

Pull request to add architecture support for the cohere2-MoE model to llama.cpp, enabling inference of this Mixture of Experts model.

0 favorites 0 likes

#llama-cpp

Feral v0.2.0 - open-source local AI workspace (llama.cpp + BYOK + agent runtime), now on Windows, macOS and Linux. No telemetry, no subscription, MIT/Apache-2.0

Reddit r/AI_Agents ↗ · 2026-06-12

Feral v0.2.0 is an open-source local AI workspace that runs GGUF models via llama.cpp, supports BYOK for cloud models, includes an agent runtime with sandboxed tools and a knowledge graph, and now ships on Windows, macOS, and Linux with no telemetry or subscription.

0 favorites 0 likes

#llama-cpp

How to setup a local coding agent on macOS

Hacker News Top ↗ · 2026-06-12 Cached

A detailed tutorial on setting up a local coding agent on macOS using Gemma 4 with MTP draft model and llama.cpp, achieving ~24% speed improvement through speculative decoding.

0 favorites 0 likes

#llama-cpp

PWA Support has been merged

Reddit r/LocalLLaMA ↗ · 2026-06-12

PWA support has been merged into llama.cpp, allowing the llama-server web UI to be installed as a native-like app with standalone window mode and proper icons.

0 favorites 0 likes

#llama-cpp

@juanjucm: I'm seeing a lot of angry people lately... remember, you can always run your coding agent locally ;) llama.cpp + OpenCo…

X AI KOLs Following ↗ · 2026-06-12 Cached

Tweet reminding developers they can run coding agents locally using llama.cpp and OpenCode for fast, reliable, and private inference, demonstrating with UnslothAI's North-Mini-Code-1.0-GGUF model.

0 favorites 0 likes

#llama-cpp

Not All MTP Assistants Are Created Equal

Reddit r/LocalLLaMA ↗ · 2026-06-12

A detailed technical exploration of MTP speculative decoding in llama.cpp with Gemma 4 models, showing that assistant model selection and quantization significantly impact speedups, and that not all 'same name' assistants perform equally.

0 favorites 0 likes

#llama-cpp

EAGLE3 has landed in llama.cpp

Reddit r/LocalLLaMA ↗ · 2026-06-12 Cached

EAGLE3, a speculative decoding method, has been integrated into llama.cpp, enabling faster inference.

0 favorites 0 likes

#llama-cpp

PSA: Test your "threads" argument in llama.cpp (+80% performance in my case)

Reddit r/LocalLLaMA ↗ · 2026-06-12

A user benchmarks thread count for hybrid CPU-GPU inference with Gemma 4 in llama.cpp, discovering a 80% performance uplift by using 16 threads instead of 6 on a hybrid core CPU, and shares the optimal command configuration.

0 favorites 0 likes

#llama-cpp

How do i prevent llama.cpp from offloading on Swap?

Reddit r/LocalLLaMA ↗ · 2026-06-11

User seeks advice on preventing llama.cpp from offloading KV cache to swap before RAM is fully exhausted, sharing their configuration on an M2 Max with 96GB RAM and a large Qwen model.

0 favorites 0 likes

#llama-cpp

Remove padding and multiple D2D copies for MTP by gaugarg-nv · Pull Request #24086 · ggml-org/llama.cpp

Reddit r/LocalLLaMA ↗ · 2026-06-10 Cached

A pull request for llama.cpp that removes padding and multiple device-to-device copies for Multi-Token Prediction (MTP), improving performance on GPU.

0 favorites 0 likes

#llama-cpp

unsloth/diffusiongemma-26B-A4B-it-GGUF

Hugging Face Models Trending ↗ · 2026-06-10 Cached

Unsloth releases GGUF quantizations of Google DeepMind's DiffusionGemma (26B-A4B), a new block-diffusion architecture for faster text generation, ready for llama.cpp.

0 favorites 0 likes

#llama-cpp

@leopardracer: GEMMA 4 26B ON AN RTX 4060 WITH A 248K TOKEN CONTEXT WINDOW 20 tokens per second and a context window so large you can …

X AI KOLs Timeline ↗ · 2026-06-10 Cached

Gemma 4 26B runs on an RTX 4060 with 248K token context at 20 tokens per second using llama.cpp and Q4_K_XL quantization, enabling local processing of entire codebases on consumer hardware.

0 favorites 0 likes

#llama-cpp

The 'storage tax' on cloud GPUs for short LLM runs is brutal. What's your workflow?

Reddit r/AI_Agents ↗ · 2026-06-10

User seeks advice on cost-effective cloud GPU workflows for short LLM testing sessions, highlighting storage fees as a key pain point when preserving environments between runs.

0 favorites 0 likes

#llama-cpp

unsloth/North-Mini-Code-1.0-GGUF · Hugging Face

Reddit r/LocalLLaMA ↗ · 2026-06-10 Cached

This page hosts GGUF quantized versions of Cohere's North-Mini-Code-1.0 model, a 30B-A3B MoE model optimized for code generation and agentic tasks. Instructions are provided for building llama.cpp from a specific PR to support the cohere2moe architecture.

0 favorites 0 likes

llama-cpp

Submit Feedback