Mimo 2.5 is _fast_ at large context (dual RTX Pro 6000)
Summary
Mimo 2.5 demonstrates fast performance with large context windows using dual RTX Pro 6000 GPUs.
Similar Articles
@0xSero: Minimax-M3 running on 4x RTX Pro 6000s - 800k context - 4x concurrency at 250k - 70-120 tok/s - 2000 tok/s prefill no c…
Minimax-M3 is demonstrated running on 4x RTX Pro 6000 GPUs with 800k context, achieving 70-120 tok/s inference and 2000 tok/s prefill at 4x concurrency using 376GB VRAM in mxfp4 format.
Mimo V 2.5 and Mimo V 2.5 Pro released.
Mimo V 2.5 and Mimo V 2.5 Pro have been released, offering updated features and improvements.
RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help
Detailed benchmarks of Qwen3.6 35B MoE on RTX 5080 16GB show that MTP (Multi-Token Prediction) does not improve inference speed at 128k context due to VRAM constraints; the best configuration is Q4_K_XL without MTP, achieving ~56 tok/s generation at 128k context.
500k context on 48gb VRAM!! - 21tok/s (coding)
A user reports successful deployment of a quantized Nemotron-3 Super model supporting 500k context and agentic coding on consumer-grade dual Titan RTX hardware.
Qwen3.6-35B-A3B Q4 262k context on 8GB 3070 Ti = +30tps
The author shares detailed tuning tips for running the Qwen3.6-35B-A3B MoE model on an 8GB RTX 3070 Ti with up to 262k context using llama.cpp, achieving 30+ tps, and notes a 25% speed boost when switching from Windows to Ubuntu Server.