@0xSero: Minimax-M3 running on 4x RTX Pro 6000s - 800k context - 4x concurrency at 250k - 70-120 tok/s - 2000 tok/s prefill no c…

X AI KOLs Following 06/14/26, 10:46 PM Models

Summary

Minimax-M3 is demonstrated running on 4x RTX Pro 6000 GPUs with 800k context, achieving 70-120 tok/s inference and 2000 tok/s prefill at 4x concurrency using 376GB VRAM in mxfp4 format.

Minimax-M3 running on 4x RTX Pro 6000s - 800k context - 4x concurrency at 250k - 70-120 tok/s - 2000 tok/s prefill no cache - 376gb vram - mxfp4 It's working on improving the audio on one of my videos, it's actually doing a good job in researching solutions. Good model https://t.co/7QcuzrDnEK

Original Article

View Cached Full Text

Cached at: 06/15/26, 09:00 AM

Minimax-M3 running on 4x RTX Pro 6000s

800k context
4x concurrency at 250k
70-120 tok/s
2000 tok/s prefill no cache
376gb vram
mxfp4

It’s working on improving the audio on one of my videos, it’s actually doing a good job in researching solutions.

Good model https://t.co/7QcuzrDnEK

Similar Articles

RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help

Reddit r/LocalLLaMA

Detailed benchmarks of Qwen3.6 35B MoE on RTX 5080 16GB show that MTP (Multi-Token Prediction) does not improve inference speed at 128k context due to VRAM constraints; the best configuration is Q4_K_XL without MTP, achieving ~56 tok/s generation at 128k context.

@0xSero: GLM-5.1-478B-NVFP4 Running on: - 4x RTX Pro 6000 - Sglang - 370,000 max tokens (1.75x full context) - p10 27.7 | p90 45…

X AI KOLs Timeline

A quantized 478B-parameter GLM-5.1 model runs on 4×RTX Pro 6000 GPUs via SGLang, delivering 370k-token context at up to 45 tok/s decode and 1340 tok/s prefill, and is demoed driving Figma.

8-16 MI50s Minimax M3 @19 tps TG (peak)

Reddit r/LocalLLaMA

Reports a peak throughput of 19 tokens per second for the Minimax M3 model running on 8-16 MI50 GPUs.

@TeksEdge: With MiniMax M3 open source now out, here is what to expect on quants and sizes, including VRAM needed: MiniMax M3 (428…

X AI KOLs Following

MiniMax M3, a 428B MoE model with ~23B active parameters, is now open source. It offers ultra-long context (up to 1M) and efficiency improvements, with various quantized sizes and VRAM requirements for local deployment.

@stevibe: MiniMax M2.7 is 230B params. Can you actually run it at home? I tested Unsloth's UD-IQ3_XXS (80GB) on 4 different rigs:…