@0xSero: Minimax-M3 running on 4x RTX Pro 6000s - 800k context - 4x concurrency at 250k - 70-120 tok/s - 2000 tok/s prefill no c…
Summary
Minimax-M3 is demonstrated running on 4x RTX Pro 6000 GPUs with 800k context, achieving 70-120 tok/s inference and 2000 tok/s prefill at 4x concurrency using 376GB VRAM in mxfp4 format.
View Cached Full Text
Cached at: 06/15/26, 09:00 AM
Minimax-M3 running on 4x RTX Pro 6000s
- 800k context
- 4x concurrency at 250k
- 70-120 tok/s
- 2000 tok/s prefill no cache
- 376gb vram
- mxfp4
It’s working on improving the audio on one of my videos, it’s actually doing a good job in researching solutions.
Good model https://t.co/7QcuzrDnEK
Similar Articles
RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help
Detailed benchmarks of Qwen3.6 35B MoE on RTX 5080 16GB show that MTP (Multi-Token Prediction) does not improve inference speed at 128k context due to VRAM constraints; the best configuration is Q4_K_XL without MTP, achieving ~56 tok/s generation at 128k context.
@0xSero: GLM-5.1-478B-NVFP4 Running on: - 4x RTX Pro 6000 - Sglang - 370,000 max tokens (1.75x full context) - p10 27.7 | p90 45…
A quantized 478B-parameter GLM-5.1 model runs on 4×RTX Pro 6000 GPUs via SGLang, delivering 370k-token context at up to 45 tok/s decode and 1340 tok/s prefill, and is demoed driving Figma.
8-16 MI50s Minimax M3 @19 tps TG (peak)
Reports a peak throughput of 19 tokens per second for the Minimax M3 model running on 8-16 MI50 GPUs.
@TeksEdge: With MiniMax M3 open source now out, here is what to expect on quants and sizes, including VRAM needed: MiniMax M3 (428…
MiniMax M3, a 428B MoE model with ~23B active parameters, is now open source. It offers ultra-long context (up to 1M) and efficiency improvements, with various quantized sizes and VRAM requirements for local deployment.
@stevibe: MiniMax M2.7 is 230B params. Can you actually run it at home? I tested Unsloth's UD-IQ3_XXS (80GB) on 4 different rigs:…
A user tested MiniMax M2.7 (230B parameter model) using Unsloth's UD-IQ3_XXS quantization (80GB) across four different hardware configurations including RTX 4090, RTX 5090, RTX PRO 6000, and DGX setups, reporting token generation speeds and time-to-first-token metrics.