@no_stp_on_snek: @antirez Turbo3 BEATS fp8 by +5% decode tok/s at 32K context still tinkering but i've been cooking TQ+ in your kitchen

X AI KOLs Following 05/23/26, 11:14 PM Tools

turbo3 fp8 performance decode optimization quantization

Summary

Turbo3 achieves 5% faster decode tokens per second compared to fp8 at 32K context, a performance improvement in quantization or model optimization.

@antirez 🔥Turbo3 BEATS fp8 by +5% decode tok/s at 32K context still tinkering but i've been cooking TQ+ in your kitchen

Original Article

View Cached Full Text

Cached at: 05/25/26, 10:45 PM

@antirez

🔥Turbo3 BEATS fp8 by +5% decode tok/s at 32K context

still tinkering but i’ve been cooking TQ+ in your kitchen

Similar Articles

Got MTP + TurboQuant running — Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090

Reddit r/LocalLLaMA

Developer achieved 80+ t/s inference on Qwen3.6-27B with 262K context on a single RTX 4090 by combining MTP (Multi-Token Prediction) with TurboQuant's lossless KV cache compression, sharing their implementation fork and technical details.

@iotcoi: Qwen3.6-27B-FP8 + Dflash + DDTree, 256k context, 10 agents ~200 tokens/sec max decode 136t/s average on a single tiny G…

X AI KOLs Timeline

Quantized 27B Qwen3.6 model achieves 200 tok/s peak (136 avg) with 256k context and 10 agents on a single 49W GB10 GPU using Dflash+DDTree optimizations.

@no_stp_on_snek: https://x.com/no_stp_on_snek/status/2052833502475833384

X AI KOLs Following

An open-source stack using Qwen2.5-32B-Instruct with longctx and vllm-turboquant on a single AMD MI300X achieves competitive results (0.601-0.688) versus SubQ's closed model (0.659) on the MRCR v2 1M-context benchmark, demonstrating open-weights approaches are within striking distance.

2X tk/s (from 19.4 -> 38.1 tk/s on 1 x MI50) Playing with a hypothesis like speculative decoding.. but instead of an additional side model, exploiting that I can run multiple computations side-by-side AS IF I had Qwen3.6-27B loaded twice in memory - small quants don't use all the available compute.

Reddit r/LocalLLaMA

Packed Twin Inference (PTI) is a technique that achieves ~2× LLM throughput by running multiple token sequences in a single batch decode, exploiting weight sharing in llama.cpp without needing a draft model or additional VRAM.

@witcheer: everyone says NVFP4 makes blackwell cards "faster." I benchmarked Qwen3.6-27B three ways on my 5090: >NVFP4 >plain Q4_K…