model-inference

Tag

Cards List
#model-inference

Best Settings for 48GB VRAM + Qwen 3.6 27B

Reddit r/LocalLLaMA · yesterday

A user shares optimized settings for running Qwen3.6 27B (Q8_0) on a dual GPU setup (RTX 4090 + RTX 3090) with llama.cpp, achieving 75-100 t/s and 1500 pp with 250k context.

0 favorites 0 likes
#model-inference

Added an old 2070 Super to my rig and I can't go back...worse, now I need more

Reddit r/LocalLLaMA · 2026-05-31

A user shares their experience of adding an old NVIDIA 2070 Super GPU to their rig for extra VRAM, enabling them to run larger LLMs like Qwen3.6-27B at high quantization and context size with good performance, and now considering upgrading to a 3090 for even more VRAM.

0 favorites 0 likes
#model-inference

qwen3.6-35b-a3b-mtp running on GTX 1060 6GB

Reddit r/LocalLLaMA · 2026-05-24

A user successfully runs the Qwen3.6-35B-a3b-MTP model on a decade-old workstation with a GTX 1060 6GB using LMStudio under Windows, achieving acceptable chat speeds.

0 favorites 0 likes
#model-inference

Qwen3.6-35B-A3B Q4 262k context on 8GB 3070 Ti = +30tps

Reddit r/LocalLLaMA · 2026-05-22

The author shares detailed tuning tips for running the Qwen3.6-35B-A3B MoE model on an 8GB RTX 3070 Ti with up to 262k context using llama.cpp, achieving 30+ tps, and notes a 25% speed boost when switching from Windows to Ubuntu Server.

0 favorites 0 likes
#model-inference

Looking for early users to try our OpenClaw model plans and tell us what's broken (15–30 min)

Reddit r/openclaw · 2026-05-20

OpenClaw is seeking early users to test their open-source model inference plans, sold by concurrency slot with high throughput and no shared pool, in exchange for free access and feedback.

0 favorites 0 likes
#model-inference

How fast is 10 tokens per second really?

Simon Willison's Blog · 2026-05-20 Cached

Simon Willison explores the practical meaning of 10 tokens per second speed for large language models, offering context on how fast that feels and its implications for usability.

0 favorites 0 likes
#model-inference

@julien_c: I've seen some confusion online on how to run llama.cpp with MTP (Multi-token prediction) in the simplest way possible.…

X AI KOLs Following · 2026-05-19 Cached

Julien C explains how to run llama.cpp with Multi-token prediction (MTP) for ~2x generation speed, using either the Dense 27B or MoE 35B model, with instructions for installation and configuration.

0 favorites 0 likes
#model-inference

Now that MTP is merged... What's the best outputs you're getting on Qwen 3.6 35B on 2x3090s?

Reddit r/LocalLLaMA · 2026-05-16

Discussion of performance tradeoffs when using the new MTP merge in llama.cpp to run Qwen 3.6 35B on dual 3090s, with users sharing token speeds and seeking optimal configurations.

0 favorites 0 likes
#model-inference

What's in a GGUF, besides the weights – and what's still missing?

Hacker News Top · 2026-05-14 Cached

This article explores the GGUF file format used by llama.cpp for language models, highlighting its single-file convenience and the role of embedded chat templates and special tokens. It also compares different Jinja implementations and discusses what is still missing from the format.

0 favorites 0 likes
#model-inference

@jun_song: If we ever figure out how to load ONLY the active params of an MoE into the GPU instead of the full weights, it's game …

X AI KOLs Following · 2026-05-10

The author speculates that loading only active parameters of MoE models onto GPUs could drastically improve efficiency and allow running large models like Kimi locally, though acknowledges this is currently impractical.

0 favorites 0 likes
#model-inference

@sanbuphy: K2.6 successfully downloaded and deployed the Qwen3.5-0.8B model locally on a Mac, using the niche Zig language to implement and optimize inference, demonstrating the new model’s generalization ability. After 4,000+ tool calls and 12+ hours of continuous operation, K2.6 iterated 14 times…

X AI KOLs Timeline · 2026-04-21 Cached

K2.6 successfully downloaded and deployed the Qwen3.5-0.8B model locally on a Mac, using the niche Zig language to implement and optimize inference, demonstrating the new model’s generalization ability. After 4,000+ tool calls and 12+ hours of continuous operation, K2.6 iterated 14 times, boosting throughput from ~15 tokens/s to ~193 tokens/s, ultimately achieving 20% faster inference than LM Studio.

0 favorites 0 likes
← Back to home

Submit Feedback