z-lab/Qwen3.6-27B-DFlash

Hugging Face Models Trending Models

Summary

This article introduces Qwen3.6-27B-DFlash, a specialized drafter model for DFlash, a novel speculative decoding method using block diffusion to accelerate inference speed. It provides installation instructions for vLLM and SGLang to enable parallel drafting with the target Qwen3.6-27B model.

Task: text-generation Tags: transformers, safetensors, qwen3, feature-extraction, dflash, speculative-decoding, diffusion, efficiency, flash-decoding, qwen, diffusion-language-model, text-generation, custom_code, arxiv:2602.06036, license:mit, text-generation-inference, endpoints_compatible, region:us
Original Article
View Cached Full Text

Cached at: 05/08/26, 09:04 AM

z-lab/Qwen3.6-27B-DFlash · Hugging Face

Source: https://huggingface.co/z-lab/Qwen3.6-27B-DFlash Paper|GitHub|Blog

This model is still under training, and inference engine support may not be fully available yet due to architectural changes, including causal SWA layers.

DFlashis a novel speculative decoding method that utilizes a lightweightblock diffusionmodel for drafting. It enables efficient, high-quality parallel drafting that pushes the limits of inference speed.

This model is thedraftercomponent. It must be used in conjunction with the target modelQwen/Qwen3\.6\-27B.

DFlash Architecture

https://huggingface.co/z-lab/Qwen3.6-27B-DFlash#quick-startQuick Start

https://huggingface.co/z-lab/Qwen3.6-27B-DFlash#installationInstallation

vLLM (We temporarily modify the installation through this PR to support interleaved SWA and ensure correct handling of target hidden states for optimal performance):

uv pip install vllm
uv pip install -U --torch-backend=auto "vllm @ git+https://github.com/vllm-project/vllm.git@refs/pull/40898/head"

SGLang:

uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/23000/head#subdirectory=python"

https://huggingface.co/z-lab/Qwen3.6-27B-DFlash#launch-serverLaunch Server

vLLM:

vllm serve Qwen/Qwen3.6-27B \
  --speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-27B-DFlash", "num_speculative_tokens": 15}' \
  --attention-backend flash_attn \
  --max-num-batched-tokens 32768

SGLang:

# Optional: enable schedule overlapping (experimental, may not be stable)
# export SGLANG_ENABLE_SPEC_V2=1
# export SGLANG_ENABLE_DFLASH_SPEC_V2=1
# export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1

python -m sglang.launch_server \
    --model-path Qwen/Qwen3.6-27B \
    --speculative-algorithm DFLASH \
    --speculative-draft-model-path z-lab/Qwen3.6-27B-DFlash \
    --speculative-num-draft-tokens 16 \
    --tp-size 1 \
    --attention-backend fa3 \
    --mem-fraction-static 0.75 \
    --mamba-scheduler-strategy extra_buffer \
    --trust-remote-code

https://huggingface.co/z-lab/Qwen3.6-27B-DFlash#usageUsage

from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="Qwen/Qwen3.6-27B",
    messages=[{"role": "user", "content": "Write a quicksort in Python."}],
    max_tokens=4096,
    temperature=0.0
)
print(response.choices[0].message.content)

https://huggingface.co/z-lab/Qwen3.6-27B-DFlash#benchmark-resultsBenchmark Results

N/A

https://huggingface.co/z-lab/Qwen3.6-27B-DFlash#acknowledgementsAcknowledgements

Special thanks toDavid Wangfor his outstanding engineering support on this project. We are also grateful toModal,InnoMatrix, andYotta Labsfor providing the compute resources used to train this draft model.

https://huggingface.co/z-lab/Qwen3.6-27B-DFlash#citationCitation

If you find DFlash useful, please cite our work. To share feedback on DFlash or request new model support, please fill out this form:DFlash Feedback.

@article{chen2026dflash,
  title   = {{DFlash: Block Diffusion for Flash Speculative Decoding}},
  author  = {Chen, Jian and Liang, Yesheng and Liu, Zhijian},
  journal = {arXiv preprint arXiv:2602.06036},
  year    = {2026}
}

Similar Articles

z-lab/Qwen3.6-35B-A3B-DFlash

Hugging Face Models Trending

z-lab releases DFlash, a speculative decoding drafter that uses a lightweight block-diffusion model to draft 15–16 tokens in parallel, yielding up to 2.9× speedup for Qwen3.6-35B-A3B inference.

DFlash and Spec V2 Decoding (14 minute read)

TLDR AI

Z Lab, SGLang, and Modal release DFlash, a new speculative decoding model for Qwen 3.5 397B-A17B that uses block diffusion and KV injection to achieve over 4x throughput improvement over baseline and 1.5x over native MTP.

z-lab/dflash

GitHub Trending (daily)

DFlash introduces a block diffusion method for flash speculative decoding to enhance inference speed in large language models.