@no_stp_on_snek: small update from the long-context experiments: I got MRCR v2 running out to 1M on a single MI300X droplet with an open…

X AI KOLs Following 05/07/26, 03:30 PM News

long-context mrcr qwen2.5 retrieval-augmented-generation hardware-efficiency benchmarking

Summary

The author reports successful experiments running MRCR v2 with 1M context length on a single MI300X using Qwen2.5-32B and FAISS, achieving competitive scores at low cost.

small update from the long-context experiments: I got MRCR v2 running out to 1M on a single MI300X droplet with an open stack. Apache-2.0 Qwen2.5-32B + FAISS + retrieval/selector plumbing. Current numbers: 8K: 0.822 32K: 0.697 64K chunked: 0.670 1M mass-val: 0.601 (n=60) SubQ’s published 1M score is 0.659, so the current gap is 0.058. Cost so far: <$50 in AMD-granted DO credits. No novel attention, no custom architecture, no private model. Still working. also numbers have no kv compression in the loop :)

Original Article

Similar Articles

@no_stp_on_snek: https://x.com/no_stp_on_snek/status/2052833502475833384

X AI KOLs Following

An open-source stack using Qwen2.5-32B-Instruct with longctx and vllm-turboquant on a single AMD MI300X achieves competitive results (0.601-0.688) versus SubQ's closed model (0.659) on the MRCR v2 1M-context benchmark, demonstrating open-weights approaches are within striking distance.

@no_stp_on_snek: mrcr v2 8-needle at 1m, open weights stack, single rented mi300x. longctx directional 0.688 (n=30, mass-val rerun pendi…

X AI KOLs Following

Shares early benchmark scores and evaluation metrics for an open-weight model stack run on a single AMD MI300X, noting competitive performance against closed-source alternatives.

@0xSero: Minimax-M3 running on 4x RTX Pro 6000s - 800k context - 4x concurrency at 250k - 70-120 tok/s - 2000 tok/s prefill no c…

X AI KOLs Following

Minimax-M3 is demonstrated running on 4x RTX Pro 6000 GPUs with 800k context, achieving 70-120 tok/s inference and 2000 tok/s prefill at 4x concurrency using 376GB VRAM in mxfp4 format.

PM tried M3's 1M context on a real Q3 brief: where it held, where it broke

Reddit r/AI_Agents

A product manager shares hands-on testing of Minimax M3's 1M context window on a real Q3 strategic brief, noting strong source attribution up to ~200K tokens but synthesis degradation beyond that.

Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context

Reddit r/LocalLLaMA

The author shares a high-performance local inference configuration for running Qwen3.6 35B A3B on limited hardware (8GB VRAM, 32GB RAM) using a modified llama.cpp with TurboQuant support, achieving ~37-51 tok/sec with ~190k context.

Similar Articles

@no_stp_on_snek: https://x.com/no_stp_on_snek/status/2052833502475833384

@no_stp_on_snek: mrcr v2 8-needle at 1m, open weights stack, single rented mi300x. longctx directional 0.688 (n=30, mass-val rerun pendi…

@0xSero: Minimax-M3 running on 4x RTX Pro 6000s - 800k context - 4x concurrency at 250k - 70-120 tok/s - 2000 tok/s prefill no c…

PM tried M3's 1M context on a real Q3 brief: where it held, where it broke

Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context

Submit Feedback