@no_stp_on_snek: @antirez Turbo3 BEATS fp8 by +5% decode tok/s at 32K context still tinkering but i've been cooking TQ+ in your kitchen

X AI KOLs Following Tools

Summary

Turbo3 achieves 5% faster decode tokens per second compared to fp8 at 32K context, a performance improvement in quantization or model optimization.

@antirez 🔥Turbo3 BEATS fp8 by +5% decode tok/s at 32K context still tinkering but i've been cooking TQ+ in your kitchen
Original Article
View Cached Full Text

Cached at: 05/25/26, 10:45 PM

@antirez

🔥Turbo3 BEATS fp8 by +5% decode tok/s at 32K context

still tinkering but i’ve been cooking TQ+ in your kitchen

Similar Articles

@no_stp_on_snek: https://x.com/no_stp_on_snek/status/2052833502475833384

X AI KOLs Following

An open-source stack using Qwen2.5-32B-Instruct with longctx and vllm-turboquant on a single AMD MI300X achieves competitive results (0.601-0.688) versus SubQ's closed model (0.659) on the MRCR v2 1M-context benchmark, demonstrating open-weights approaches are within striking distance.

2X tk/s (from 19.4 -> 38.1 tk/s on 1 x MI50) Playing with a hypothesis like speculative decoding.. but instead of an additional side model, exploiting that I can run multiple computations side-by-side AS IF I had Qwen3.6-27B loaded twice in memory - small quants don't use all the available compute.

Reddit r/LocalLLaMA

Packed Twin Inference (PTI) is a technique that achieves ~2× LLM throughput by running multiple token sequences in a single batch decode, exploiting weight sharing in llama.cpp without needing a draft model or additional VRAM.