Benchmarks of DFlash speculative decoding combined with KV cache compression on RTX 5090 show up to 3.26x speedup on Qwen3.6-27B with minimal perplexity degradation, with q4_0/turbo4 providing the best balance.
**Hardware:** RTX 5090 | **Model:** Qwen3.6-27B | **Framework:** BeeLlama.cpp Full benchmark scripts, raw data, config, and generated artifacts are available on request — just DM or comment below. --- I spent the last week benchmarking [DFlash speculative decoding](https://arxiv.org/abs/2602.06036) combined with KV cache compression strategies on Qwen3.6-27B. The results are surprising enough that I wanted to share them for anyone running local inference. ## Setup - **GPU:** NVIDIA RTX 5090 (32GB VRAM) - **Model:** Qwen3.6-27B in two quantizations: UD-Q5_K_XL and NVFP4-Q8_0 - **Drafter:** Qwen3.6-27B-DFlash-Q5_K_M - **Framework:** [BeeLlama.cpp](https://github.com/Anbeeld/beellama.cpp/) (DFlash + TurboQuant/TCQ support) - **PPL dataset:** WikiText-2 - **Throughput:** Custom coding prompts (code generation tasks) ## TL;DR | Strategy | Speedup | PPL Δ | Code Quality | |----------|---------|-------|--------------| | **q4_0/turbo4** ⭐ | **3.18x** | **+0.02%** | 3.0/3.0 HTML | | turbo4/turbo4 | 3.26x | +0.04% | Tested | | turbo2_tcq/turbo2_tcq | 3.26x | +0.76% | Slight drop | | Baseline (no KV compression) | 2.92x | N/A | 2.33/3.0 | **`q4_0/turbo4` is the sweet spot:** 3.18x speedup with +0.02% PPL degradation — statistically indistinguishable from baseline K_Q8_V_Q5_1. --- ## 1. Q5_K_XL vs NVFP4-Q8_0: Which Quantization Wins? Q5_K_XL dominates NVFP4-Q8_0 across every metric when DFlash is enabled: | Quant | Baseline tok/s | Best tok/s | Max Speedup | |-------|----------------|------------|-------------| | **Q5_K_XL** | **176.5** | **195.2** | **3.26x** | | NVFP4-Q8_0 | 157.2 | 152.6 | 2.83x | Q5_K_XL is faster at baseline AND scales better with KV compression strategies. ## 2. Perplexity: KV Compression Quality Measured on WikiText-2 (lower is better). K_Q8_VQ5_1 baseline: **PPL = 1.8046 ± 0.00295** | KV Strategy | PPL | Δ vs K_Q8_VQ5_1 | |-------------|-----|-----------| | **q4_0/turbo4** | **1.8050** | **+0.02%** | | turbo4/turbo4 | 1.8053 | +0.04% | | turbo4/turbo2_tcq | 1.8100 | +0.30% | | turbo4/tcq | 1.8132 | +0.48% | | turbo2_tcq/turbo2_tcq | 1.8184 | +0.76% | The `q4_0/turbo4` strategy is within 1 standard deviation of the K_Q8_VQ5_1 baseline. **Reproduction:** ```bash python -m tests.benchmark_kv_cache --model Qwen3.6-27B-UD-Q5_K_XL-kv_q4_0_turbo4-dflash-256k ``` ## 3. Drafter Model: Confirming the Anbeeld Claim My results confirm ~3x speedup with a small drafter model as stated by Anbeeld: - **Drafter:** Qwen3.6-27B-DFlash-Q5_K_M (same architecture, smaller quant) - **Acceptance rate:** 30-51% depending on KV strategy - **Speedup range:** 2.58x to 3.26x The drafter is efficient because DFlash uses a cross-attention mechanism (not token-by-token speculation), so even a smaller drafter can propose useful token sequences. ## 4. Compression Strategy Deep Dive ### Strategy recommendations | Goal | Strategy | Trade-off | |------|----------|-----------| | Best balance | `q4_0/turbo4` | 3.18x, +0.02% PPL | | Maximum speed | `turbo4/turbo4` or `turbo2_tcq/turbo2_tcq` | 3.26x, +0.04-0.76% PPL | | Maximum quality | `q8_0/q5_1` | Baseline, memory hungry | ## 5. Code Quality: Does Compression Break Generation? Benchmarked by generating a Tetris game (CLI Python + single-file HTML), 3 iterations each, scored 0-3 by functional completeness: | Config | CLI | HTML | |--------|-----|------| | **Q5_K_XL + q4_0/turbo4** | **2.33/3.0** | **3.0/3.0** | | Q5_K_XL baseline | 2.0/3.0 | 2.33/3.0 | | Q5_K_XL + turbo2_tcq | 2.0/3.0 | 2.0/3.0 | | NVFP4-Q8_0 + turbo2_tcq | 2.25/3.0 | 1.67/3.0 | | NVFP4-Q8_0 baseline | 1.67/3.0 | 1.33/3.0 | KV compression with `q4_0/turbo4` actually improved code quality over the baseline (3.0/3.0 HTML vs 2.33/3.0). Generated code from all iterations is available on request. ## Reproduction Commands ```bash # Perplexity (WikiText-2) python -m tests.benchmark_kv_cache --model <model_key> # Throughput (coding tasks) python -m tests.benchmark_dflash --model <model_key> # Code quality (Tetris generation) python -m tests.benchmark_tetris --model <model_key> ``` Model keys are defined in `config.yaml`. If you're interested in the actual scripts, config, charts, or the full comprehensive report, reach out via DM or comment and I'll send everything over. ## Reproducibility I'm working on a public GitHub repo with all the necessary resources for full reproducibility (benchmark scripts, config, raw data, generated code, and charts). Currently cleaning it up and anonymizing paths. In the meantime, anything mentioned in this post is available on request — just ask. ## Links - **BeeLlama.cpp:** [https://github.com/Anbeeld/beellama.cpp/](https://github.com/Anbeeld/beellama.cpp/) - **DFlash Paper:** [https://arxiv.org/abs/2602.06036](https://arxiv.org/abs/2602.06036) @Edit: Corrected references; FP16 to K_Q8_VQ5_1 - KV cache compression I'm using as baseline; beellama github; Dflash paper reference
A new KV cache optimization called kvflash doubles generation speed and reduces VRAM usage for Qwen 3.6-27B on a single RTX 3090 while maintaining accuracy.
A benchmark of NVFP4 on an RTX 5090 with Qwen3.6-27B shows prefill speed gains of 32-42% over equal-bit Q4_K_M and 52-68% over Q6_K, but decode gains are modest (+9% vs Q4) as decode is memory-bandwidth bound. The quality loss compared to Q6 is minimal (-0.8 average), making NVFP4 a good choice for local inference.
Z Lab, SGLang, and Modal release DFlash, a new speculative decoding model for Qwen 3.5 397B-A17B that uses block diffusion and KV injection to achieve over 4x throughput improvement over baseline and 1.5x over native MTP.
A benchmark shows that using vLLM with DFlash speculative decoding boosts Gemma 4 26B inference to ~578 tokens per second on a single RTX 5090, achieving a 2.56x speedup over baseline.
A detailed account of running the Qwen3.6-35B-A3B MoE model on an 8GB laptop GPU, covering effective optimizations like --no-mmap and VRAM headroom, unexpected findings where speculative decoding improved speed by 26% contrary to benchmarks, and pitfalls with Windows and CPU bottlenecks.