BeeLlama v0.2.0 – major DFlash update. Single RTX 3090: Qwen 3.6 27B up to 164 tps (4.40x), Gemma 4 31B up to 177.8 tps (4.93x). Prompt processing speed near baseline.

Reddit r/LocalLLaMA 05/22/26, 05:34 PM Tools

beellama dflash inference performance llama-cpp open-source speculative-decoding

Summary

BeeLlama v0.2.0 introduces major DFlash speculative decoding improvements, achieving up to 4.93x speedup on single RTX 3090 for Gemma 4 31B and 4.40x for Qwen 3.6 27B, with prompt processing near baseline.

**BeeLlama v0.2.0 is here!** >Not quite a pegasus, but close enough. [**GitHub**](https://github.com/Anbeeld/beellama.cpp) **|** [**Qwen 3.6 27B Quick Start**](https://github.com/Anbeeld/beellama.cpp/blob/main/docs/quickstart-qwen36-dflash.md) **|** [**Gemma 4 31B Quick Start**](https://github.com/Anbeeld/beellama.cpp/blob/main/docs/quickstart-gemma-4-31b-dflash.md) * Full Gemma 4 31B support with efficient DFlash implementation and vision. * Major Qwen 3.6 27B performance update from lower DFlash overhead, cleaner prefill handling, drafter K/V projection caching, and safer CUDA execution. * DFlash GGUFs with upstream architecture are now supported. * Fixes to adaptive profit behavior around baseline probing. * Reduced verifier path is stricter now, with safer fallback to full logits when grammar, sampler state, or reasoning requires it. * Reasoning and tool-call boundaries were tightened. * Stricter draft/target validation and better draft-model discovery. * ...and many more improvements! **Benchmarks** * Setup: Windows 11, AMD Ryzen 7 5700X3D, 32 GB DDR4 RAM, RTX 3090 24 GB * Config: same as in quick start docs, but with reasoning off for non-chat prompts * Baseline and MTP server in comparison: llama.cpp [b9275](https://github.com/ggml-org/llama.cpp/releases/tag/b9275) CUDA 13.1 Windows prebuilt * The full text of the benchmark prompts is in [README.md on GitHub](https://github.com/Anbeeld/beellama.cpp/blob/main/README.md#dflash-speedup) **Qwen 3.6 27B** Target model: [Qwen 3.6 27B Q5\_K\_S](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF) or [Qwen 3.6 27B MTP Q5\_K\_S](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF). DFlash model: [Q4\_K\_M](https://huggingface.co/Anbeeld/Qwen3.6-27B-DFlash-GGUF). |Prompt|Server|Output|Median|Best|Speedup|Acceptance| |:-|:-|:-|:-|:-|:-|:-| |Task store module|Baseline|\~1K tok|37.2 tok/s|37.2 tok/s|1.00x|N/A| |Task store module|DFlash|\~1K tok|**163.9 tok/s**|181.9 tok/s|**4.40x**|67.7% / 89.2%| |Task store module|MTP|\~1K tok|69.3 tok/s|69.6 tok/s|1.86x|92.0% / 73.3%| |KV report module|Baseline|\~1K tok|34.6 tok/s|36.5 tok/s|1.00x|N/A| |KV report module|DFlash|\~1K tok|**157.7 tok/s**|162.5 tok/s|**4.56x**|58.8% / 88.9%| |KV report module|MTP|\~1K tok|67.3 tok/s|68.1 tok/s|1.94x|89.3% / 73.0%| |Doubly-linked list|Baseline|\~4K tok|36.8 tok/s|36.9 tok/s|1.00x|N/A| |Doubly-linked list|DFlash|\~4K tok|**130.8 tok/s**|154.1 tok/s|**3.56x**|50.4% / 86.8%| |Doubly-linked list|MTP|\~4K tok|66.3 tok/s|68.0 tok/s|1.80x|87.8% / 72.5%| |Prompt processing|Baseline|\~20K tok|1229.5 tok/s|1229.5 tok/s|1.00x|N/A| |Prompt processing|DFlash|\~20K tok|**1214.4 tok/s**|1221.7 tok/s|**0.99x**|N/A| |Prompt processing|MTP|\~20K tok|1162.6 tok/s|1164.7 tok/s|0.95x|N/A| |Multi-turn coding|Baseline|\~28K tok|33.3 tok/s|33.3 tok/s|1.00x|N/A| |Multi-turn coding|DFlash|\~30K tok|**64.6 tok/s**|65.4 tok/s|**1.94x**|24.9% / 72.9%| |Multi-turn coding|MTP|\~34K tok|56.5 tok/s|56.5 tok/s|1.70x|71.9% / 68.3%| *Acceptance: accepted to proposed draft tokens / accepted draft tokens to final generated tokens* **Gemma 4 31B** Target model: [Gemma 4 31B Q4\_K\_S](https://huggingface.co/unsloth/gemma-4-31b-it-GGUF). DFlash model: [Q5\_K\_M](https://huggingface.co/Anbeeld/gemma-4-31B-it-DFlash-GGUF). |Prompt|Server|Output|Median|Best|Speedup|Acceptance| |:-|:-|:-|:-|:-|:-|:-| |Task store module|Baseline|\~1K tok|36.1 tok/s|36.1 tok/s|1.00x|N/A| |Task store module|DFlash|\~1K tok|**177.8 tok/s**|182.0 tok/s|**4.93x**|65.7% / 90.0%| |KV report module|Baseline|\~1K tok|35.9 tok/s|36.0 tok/s|1.00x|N/A| |KV report module|DFlash|\~1K tok|**154.3 tok/s**|162.8 tok/s|**4.29x**|55.7% / 88.6%| |Doubly-linked list|Baseline|\~1.9K tok|36.0 tok/s|36.0 tok/s|1.00x|N/A| |Doubly-linked list|DFlash|\~1.9K tok|**116.6 tok/s**|127.3 tok/s|**3.24x**|44.5% / 84.9%| |Prompt processing|Baseline|\~24K tok|1021.3 tok/s|1021.3 tok/s|1.00x|N/A| |Prompt processing|DFlash|\~24K tok|**954.5 tok/s**|954.9 tok/s|**0.93x**|N/A| |Multi-turn coding|Baseline|\~12K tok|34.8 tok/s|34.8 tok/s|1.00x|N/A| |Multi-turn coding|DFlash|\~12K tok|**60.6 tok/s**|64.1 tok/s|**1.74x**|24.4% / 72.3%| *Acceptance: accepted to proposed draft tokens / accepted draft tokens to final generated tokens*

Original Article

Similar Articles

BeeLlama.cpp: advanced DFlash & TurboQuant with support of reasoning and vision. Qwen 3.6 27B Q5 with 200k context on 3090, 2-3x faster than baseline (peak 135 tps!)

Reddit r/LocalLLaMA

BeeLlama.cpp is a performance-focused fork of llama.cpp that introduces DFlash speculative decoding and TurboQuant KV-cache compression, enabling high-speed local inference of large models like Qwen 3.6 27B on consumer hardware.

Gemma 4 26B Hits 600 Tok/s on One RTX 5090

Reddit r/LocalLLaMA

A benchmark shows that using vLLM with DFlash speculative decoding boosts Gemma 4 26B inference to ~578 tokens per second on a single RTX 5090, achieving a 2.56x speedup over baseline.

Qwen 3.6 27B on 24GB VRAM setup: backend comparisons, quant choice and settings (llama.cpp, ik_llama.cpp, BeeLlama, vllm)

Reddit r/LocalLLaMA

The article compares llama.cpp backends for running Qwen 3.6 27B on an RTX 3090 24GB, finding ik_llama.cpp with IQ4_KS quantization yields the best performance (1261 tok/s prefill, 72.9 tok/s decode).

@NFTCPS: Attention to those running large models locally! Someone has transformed llama.cpp into a performance beast — BeeLlama.cpp. With the same VRAM, inference speed triples and context capacity expands 7.5x. This isn't a slide deck; it's real benchmark data. It stuffs three top-tier optimizations into one codebase: DFlash speculative decoding…

X AI KOLs Timeline

BeeLlama.cpp is a fork of llama.cpp that integrates DFlash speculative decoding, TurboQuant/TCQ KV-cache compression, and adaptive draft control, achieving up to 3x faster inference and 7.5x context expansion on the same hardware.

Benchmarking the new b9200 update: Optimizing Qwen 3.6 27B mtp for Hermes Agent on a single RTX 3090

Reddit r/LocalLLaMA

Benchmarking the b9200 update of llama.cpp with optimized flags for Qwen 3.6 27B MTP on a single RTX 3090 shows significant performance gains, especially in prompt processing speed, for agentic workflows.

Submit Feedback