Tag
Benchmarks of DFlash speculative decoding combined with KV cache compression on RTX 5090 show up to 3.26x speedup on Qwen3.6-27B with minimal perplexity degradation, with q4_0/turbo4 providing the best balance.
A user reports near-linear performance scaling when adding a second RTX 3090 for inference with a Qwen model, achieving roughly 1.8x decode TPS improvement without NVLink.
Benchmarks of Multi-Token Prediction (MTP) on Gemma 4 31B and Qwen 3.6 27B using vLLM and llama.cpp show up to 3.34x faster inference, with optimal speculative token counts varying by model and engine.
Achieved 1000 tokens per second generation on Qwen3.6 27B using V100 GPUs with 128 concurrent requests, and 80 t/s for single user.