@no_stp_on_snek: https://x.com/no_stp_on_snek/status/2052833502475833384

X AI KOLs Following News

Summary

An open-source stack using Qwen2.5-32B-Instruct with longctx and vllm-turboquant on a single AMD MI300X achieves competitive results (0.601-0.688) versus SubQ's closed model (0.659) on the MRCR v2 1M-context benchmark, demonstrating open-weights approaches are within striking distance.

https://t.co/3Z03DEMelO
Original Article
View Cached Full Text

Cached at: 05/08/26, 07:36 PM

Open Longctx vs Closed SubQ on MRCR v2 1M. Receipts.

Pre-Note: this is a benchmark post, not a vendor takedown. Respect to the SubQ team’s launch and their published 0.659 on MRCR v2 8-needle at 1M context. That is a real number on a hard benchmark, and it deserved a careful response.

TL;DR

  • MRCR v2 8-needle at 1M context. Open weights stack: Qwen2.5-32B-Instruct + longctx + vllm-turboquant on a single AMD MI300X.

  • Mass-validated (n=60), single-query selector + bge-rerank + deterministic copy: 0.601.

  • Directional (n=30), same recipe with a multi-query fan-out in front: 0.688.

  • SubQ’s published 1M: 0.659.

  • Conservative read: open within striking distance. Bullish read pending the n=80 mass-validation rerun: open crosses.

  • The pivot: the model picks one of eight retrieved candidates, Python copies the answer byte-perfect. No generation step at long context.

I wanted to know how close an open-weights stack could get on the same benchmark, on commodity hardware, with no proprietary architecture. So I ran it.

Setup

Hardware: a single AMD MI300X on a DigitalOcean dev cloud droplet, ROCm 7.2. Model: Qwen2.5-32B-Instruct, Apache 2.0, no fine-tuning. Engine: my own vllm-turboquant fork at the BF16 KV baseline, no compression on this run. Retrieval: longctx, the open-source companion service I have been working on for the last few days.

Every piece is open weights, open source, and runs on a single rented GPU.

The numbers

MRCR v2 8-needle, per-bin recall on Qwen2.5-32B-Instruct via plain RAG: 0.822 at 8K, 0.697 at 32K, 0.641 at 64K, 0.670 at 64K with 2000-character chunks, 0.440 at 1M baseline. Plain RAG is the cheapest recipe and falls off at 1M as expected.

The interesting bin is 1M, where SubQ’s published number is 0.659. Two longctx recipes here:

The conservative recipe is single-query selector with bge-reranker-v2-m3 plus a deterministic copy step. Mass-validated at n=60: 0.601. Below SubQ by 0.058 absolute. Solid number, conservatively reported.

The aggressive recipe is the same structure with a multi-query fan-out in front. Directional at n=30: 0.688. Above SubQ by 0.029 absolute. The n=80 mass-validation rerun is pending after the priority run OOM’d on the droplet before completing.

So the conservative read is: open within striking distance at 1M. The bullish read, pending mass-validation: the multi-query recipe crosses.

The recipe

The engineering insight matters more than the headline number for anyone wanting to reproduce. The recipe is “the model selects evidence, the system extracts the answer.” That is a different shape than vanilla RAG, where the model has to generate the answer from retrieved context.

Step one is the multi-query fan-out. Take the user query and generate four paraphrase variants in a small upstream call. Retrieve the top-100 chunks per variant via cosine similarity, union the results.

Step two is the reranker. bge-reranker-v2-m3 narrows the union down to the top eight. This is the precision step. The reranker runs on CPU and adds about eighty seconds at top-100 union, which is the dominant latency component.

Step three is the selector. One LLM call reads the top-eight candidates and outputs a single integer index. That is the entire generation budget for the answer step. Small task, easy for the model.

Step four is the deterministic copy. Python takes the picked index and copies that candidate’s text into the answer position. Byte-perfect, no generation involved.

The model never has to produce the verbatim answer at 1M context. The verbatim part is Python. The model’s job is just picking which candidate is right.

That is the architectural pivot. Once you stop asking the LLM to do the copy, retrieval-anchored long context becomes a different problem.

Why MRCR v2 matters

The 8-needle variant means the haystack contains eight distractor messages on the same topic as the target, and you have to retrieve the specific one identified by an ordinal cue (“the third assistant message about asyncio cancellation”). Retrieval alone gets you near the right cluster. The work is filtering within the cluster.

That is why the selector step matters. The model is not generating the answer from long context. It is making a small selection from a small set of finalists that retrieval already narrowed down to.

Caveats

Honest list. The n=30 multi-query directional 0.688 is real but small-n; the headline number could shift on the n=80 mass-validation rerun. The single-query mass-validation at 0.601 sits below SubQ at 0.659; that gap is real on the conservative recipe. There is no head-to-head with SubQ directly because they have a closed API; this is benchmark-versus-benchmark, not model-versus-model. MRCR v2 8-needle is one specific benchmark; it does not measure every long-context capability SubQ is selling on.

What is different from SubQ

Different problem shape. SubQ solves “fit 12M tokens of attention” with a closed proprietary architecture, presumably a content-dependent selector layered on a custom transformer, served on B200 clusters. Longctx solves “find the relevant tokens, hand the model eight candidates, have the model pick one” with an open inference engine, open retrieval components, and open model weights.

Both work. The workloads they cover overlap heavily but are not identical. On MRCR v2 8-needle specifically, the open path is within striking distance, and possibly past on the multi-query recipe pending the rerun.

The open-versus-closed framing is honest. I am not claiming the same architecture. I am claiming the open path catches up on the benchmark SubQ used for their published number.

Closing

The n=80 mass-validation rerun is the next milestone. If it crosses 0.659, I will post the update. If it lands below, the open path is at “within striking distance” rather than “past,” which is also a real result on a benchmark a lot of money has been spent on.

Respect to SubQ. Their work raised the bar for the long-context conversation. This is the open-stack receipt for the same benchmark. Looking forward to comparing notes when the rerun lands.

Links

Paper: https://github.com/TheTom/turboquant_plus/blob/main/docs/papers/longctx-1m-and-triattention.md

Longctx (open source): https://github.com/TheTom/longctx

vllm-turboquant (the engine): https://github.com/TheTom/vllm-turboquant

Per-bin curve and raw logs: https://github.com/TheTom/longctx/blob/main/docs/results.md

Similar Articles

Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context

Reddit r/LocalLLaMA

The author shares a high-performance local inference configuration for running Qwen3.6 35B A3B on limited hardware (8GB VRAM, 32GB RAM) using a modified llama.cpp with TurboQuant support, achieving ~37-51 tok/sec with ~190k context.

@no_stp_on_snek: In progress

X AI KOLs Following

Promoting Atlas Inference, an open-source inference serving tool that achieved 200+ tok/s on a Qwen3.6-35B-A3B benchmark.

@Snixtp: https://x.com/Snixtp/status/2055734339346768225

X AI KOLs Timeline

A user benchmarks the MTP variant of Qwen3.6 27B against the normal version on a single RTX 3090 using llama.cpp, finding MTP offers up to 2.37x faster generation at long contexts (32k-64k) but with slower prefill and no concurrency support yet.