@no_stp_on_snek: small update from the long-context experiments: I got MRCR v2 running out to 1M on a single MI300X droplet with an open…
Summary
The author reports successful experiments running MRCR v2 with 1M context length on a single MI300X using Qwen2.5-32B and FAISS, achieving competitive scores at low cost.
Similar Articles
@no_stp_on_snek: https://x.com/no_stp_on_snek/status/2052833502475833384
An open-source stack using Qwen2.5-32B-Instruct with longctx and vllm-turboquant on a single AMD MI300X achieves competitive results (0.601-0.688) versus SubQ's closed model (0.659) on the MRCR v2 1M-context benchmark, demonstrating open-weights approaches are within striking distance.
@no_stp_on_snek: mrcr v2 8-needle at 1m, open weights stack, single rented mi300x. longctx directional 0.688 (n=30, mass-val rerun pendi…
Shares early benchmark scores and evaluation metrics for an open-weight model stack run on a single AMD MI300X, noting competitive performance against closed-source alternatives.
@0xSero: Minimax-M3 running on 4x RTX Pro 6000s - 800k context - 4x concurrency at 250k - 70-120 tok/s - 2000 tok/s prefill no c…
Minimax-M3 is demonstrated running on 4x RTX Pro 6000 GPUs with 800k context, achieving 70-120 tok/s inference and 2000 tok/s prefill at 4x concurrency using 376GB VRAM in mxfp4 format.
PM tried M3's 1M context on a real Q3 brief: where it held, where it broke
A product manager shares hands-on testing of Minimax M3's 1M context window on a real Q3 strategic brief, noting strong source attribution up to ~200K tokens but synthesis degradation beyond that.
Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context
The author shares a high-performance local inference configuration for running Qwen3.6 35B A3B on limited hardware (8GB VRAM, 32GB RAM) using a modified llama.cpp with TurboQuant support, achieving ~37-51 tok/sec with ~190k context.