Tag
A user shares optimized settings for running Qwen3.6 27B (Q8_0) on a dual GPU setup (RTX 4090 + RTX 3090) with llama.cpp, achieving 75-100 t/s and 1500 pp with 250k context.
A user shares their experience of adding an old NVIDIA 2070 Super GPU to their rig for extra VRAM, enabling them to run larger LLMs like Qwen3.6-27B at high quantization and context size with good performance, and now considering upgrading to a 3090 for even more VRAM.
A user successfully runs the Qwen3.6-35B-a3b-MTP model on a decade-old workstation with a GTX 1060 6GB using LMStudio under Windows, achieving acceptable chat speeds.
The author shares detailed tuning tips for running the Qwen3.6-35B-A3B MoE model on an 8GB RTX 3070 Ti with up to 262k context using llama.cpp, achieving 30+ tps, and notes a 25% speed boost when switching from Windows to Ubuntu Server.
OpenClaw is seeking early users to test their open-source model inference plans, sold by concurrency slot with high throughput and no shared pool, in exchange for free access and feedback.
Simon Willison explores the practical meaning of 10 tokens per second speed for large language models, offering context on how fast that feels and its implications for usability.
Julien C explains how to run llama.cpp with Multi-token prediction (MTP) for ~2x generation speed, using either the Dense 27B or MoE 35B model, with instructions for installation and configuration.
Discussion of performance tradeoffs when using the new MTP merge in llama.cpp to run Qwen 3.6 35B on dual 3090s, with users sharing token speeds and seeking optimal configurations.
This article explores the GGUF file format used by llama.cpp for language models, highlighting its single-file convenience and the role of embedded chat templates and special tokens. It also compares different Jinja implementations and discusses what is still missing from the format.
The author speculates that loading only active parameters of MoE models onto GPUs could drastically improve efficiency and allow running large models like Kimi locally, though acknowledges this is currently impractical.
K2.6 successfully downloaded and deployed the Qwen3.5-0.8B model locally on a Mac, using the niche Zig language to implement and optimize inference, demonstrating the new model’s generalization ability. After 4,000+ tool calls and 12+ hours of continuous operation, K2.6 iterated 14 times, boosting throughput from ~15 tokens/s to ~193 tokens/s, ultimately achieving 20% faster inference than LM Studio.