@no_stp_on_snek: small update from the long-context experiments: I got MRCR v2 running out to 1M on a single MI300X droplet with an open…

X AI KOLs Following News

Summary

The author reports successful experiments running MRCR v2 with 1M context length on a single MI300X using Qwen2.5-32B and FAISS, achieving competitive scores at low cost.

small update from the long-context experiments: I got MRCR v2 running out to 1M on a single MI300X droplet with an open stack. Apache-2.0 Qwen2.5-32B + FAISS + retrieval/selector plumbing. Current numbers: 8K: 0.822 32K: 0.697 64K chunked: 0.670 1M mass-val: 0.601 (n=60) SubQ’s published 1M score is 0.659, so the current gap is 0.058. Cost so far: <$50 in AMD-granted DO credits. No novel attention, no custom architecture, no private model. Still working. also numbers have no kv compression in the loop :)
Original Article

Similar Articles

@no_stp_on_snek: https://x.com/no_stp_on_snek/status/2052833502475833384

X AI KOLs Following

An open-source stack using Qwen2.5-32B-Instruct with longctx and vllm-turboquant on a single AMD MI300X achieves competitive results (0.601-0.688) versus SubQ's closed model (0.659) on the MRCR v2 1M-context benchmark, demonstrating open-weights approaches are within striking distance.

Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context

Reddit r/LocalLLaMA

The author shares a high-performance local inference configuration for running Qwen3.6 35B A3B on limited hardware (8GB VRAM, 32GB RAM) using a modified llama.cpp with TurboQuant support, achieving ~37-51 tok/sec with ~190k context.