@RedHat_AI: Qwen3-8B 现已拥有 DFlash 投机模型！在数学推理任务上首词接受率高达 82.2%，每步平均接受 3.74 个 token……

X AI KOLs Following 2026/05/10 14:16 模型

qwen3 speculative-decoding red-hat vllm hugging-face inference-optimization

摘要

Red Hat AI 发布了用于 Qwen3-8B 的 DFlash 投机模型，在数学推理任务上实现了 82.2% 的首词接受率。该模型使用 Speculators 库和 vLLM 进行训练，以优化推理速度。

Qwen3-8B 现已拥有 DFlash 投机模型！在数学推理任务上首词接受率达 82.2%，每步平均接受 3.74 个 token。基于 Speculators 库构建。训练计算资源由 @modal 赞助。https://huggingface.co/RedHatAI/Qwen3-8B-speculator.dflash…

查看原文

查看缓存全文

缓存时间: 2026/05/11 04:33

Qwen3-8B 现在拥有 DFlash 推测模型！在数学推理任务中首 token 接受率为 82.2%，平均每步接受 3.74 个 token。基于 Speculators 库构建，训练算力由 @modal 赞助。https://huggingface.co/RedHatAI/Qwen3-8B-speculator.dflash…

RedHatAI/Qwen3-8B-speculator.dflash · Hugging Face

来源: https://huggingface.co/RedHatAI/Qwen3-8B-speculator.dflash 这是用于 Qwen/Qwen3-8B (https://huggingface.co/Qwen/Qwen3-8B) 的 DFlash 推测模型。

https://huggingface.co/RedHatAI/Qwen3-8B-speculator.dflash#training-detailsTraining Details

该模型使用 Speculators (https://github.com/vllm-project/speculators) 库，在 Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered (https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered) 的子集以及 HuggingFaceH4/ultrachat_200k (https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) 的 train_sft 分割上训练。响应由 Qwen3-8B（含推理能力）重新生成。该模型的训练算力由 Modal (https://modal.com/) 赞助。

https://huggingface.co/RedHatAI/Qwen3-8B-speculator.dflash#prepare-dataPrepare data

在已安装 speculators 的虚拟环境中

python scripts/prepare_data.py
–model Qwen/Qwen3-8B –data ./regenerated_data.jsonl
–output ./output
–seq-length 8192 ``

https://huggingface.co/RedHatAI/Qwen3-8B-speculator.dflash#launch-vllmLaunch vLLM

在已安装 vllm 的（独立）虚拟环境中

CUDA_VISIBLE_DEVICES=0,1 vllm_venv/bin/python scripts/launch_vllm.py
Qwen/Qwen3-8B
–target-layer-ids 2 10 18 26 34
– –port 8000
–gpu-memory-utilization 0.9
–disable-uvicorn-access-log
–tensor-parallel-size 1
–data-parallel-size 2 ``

https://huggingface.co/RedHatAI/Qwen3-8B-speculator.dflash#launch-trainingLaunch training

必须在 vLLM 启动完成并在后台运行时执行。

在已安装 speculators 的虚拟环境中

CUDA_VISIBLE_DEVICES=2,3 torchrun
–standalone
–nproc_per_node 2
scripts/train.py
–verifier-name-or-path Qwen/Qwen3-8B
–speculator-type dflash
–num-layers 5
–data-path ./output
–vllm-endpoint http://localhost:8000/v1
–save-path ./output/checkpoints
–epochs 3
–lr 0.0006
–total-seq-len 8192
–on-missing generate
–on-generate delete
–seed 42
–log-freq 100
–draft-vocab-size 32000
–draft-arch qwen3
–target-layer-ids 2 10 18 26 34
–draft-hidden-act silu
–scheduler-type cosine
–max-anchors 3072
–prefetch-factor 2
–num-workers 8 ``

https://huggingface.co/RedHatAI/Qwen3-8B-speculator.dflash#model-specificationsModel Specifications

基础模型: Qwen/Qwen3-8B 聊天模板: Qwen/Qwen3-8B（使用 /chat/completions 端点）格式: Safetensors 许可证: Apache 2.0 验证硬件: Nvidia H100

https://huggingface.co/RedHatAI/Qwen3-8B-speculator.dflash#deploymentDeployment

从所需的 PR 安装 vLLM

pip install git+https://github.com/vllm-project/vllm.git@refs/pull/41880/head

使用推测解码部署

vllm serve Qwen/Qwen3-8B \
–tensor-parallel-size 1 \
–max-model-len 16384 \
–speculative-config ‘{
“model”: “RedHatAI/Qwen3-8B-speculator.dflash”,
“num_speculative_tokens”: 7,
“method”: “dflash”
}’ ``

https://huggingface.co/RedHatAI/Qwen3-8B-speculator.dflash#preliminary-evaluationsPreliminary Evaluations

各数据集的位置 token 接受率：（启用推理模式）

Dataset	Pos 1	Pos 2	Pos 3	Pos 4	Pos 5	Pos 6	Pos 7	Avg Length
HumanEval	79.9%	58.0%	40.3%	27.0%	17.8%	11.3%	6.8%	3.410
math_reasoning	82.2%	62.7%	46.2%	33.5%	23.4%	15.8%	9.9%	3.740
qa	68.9%	42.6%	25.0%	14.4%	8.1%	4.4%	2.3%	2.660
question	73.0%	47.6%	30.1%	18.9%	11.7%	7.1%	4.1%	2.930
rag	71.1%	44.8%	27.0%	15.7%	8.9%	4.9%	2.5%	2.750
summarization	65.5%	36.1%	19.0%	9.5%	4.7%	2.3%	1.1%	2.380
tool_call	71.3%	44.6%	25.8%	14.4%	7.8%	4.1%	2.1%	2.700
translation	63.8%	38.4%	22.1%	11.8%	6.1%	3.2%	1.5%	2.470
writing	73.2%	47.7%	30.1%	18.9%	11.8%	7.2%	4.2%	2.930

https://huggingface.co/RedHatAI/Qwen3-8B-speculator.dflash#referencesReferences

论文: DFlash: Block Diffusion for Flash Speculative Decoding (https://arxiv.org/abs/2602.06036)

@RedHat_AI: Qwen3-8B 现已拥有 DFlash 投机模型！在数学推理任务上首词接受率高达 82.2%，每步平均接受 3.74 个 token……

RedHatAI/Qwen3-8B-speculator.dflash · Hugging Face

https://huggingface.co/RedHatAI/Qwen3-8B-speculator.dflash#training-detailsTraining Details

https://huggingface.co/RedHatAI/Qwen3-8B-speculator.dflash#prepare-dataPrepare data

在已安装 speculators 的虚拟环境中

https://huggingface.co/RedHatAI/Qwen3-8B-speculator.dflash#launch-vllmLaunch vLLM

在已安装 vllm 的（独立）虚拟环境中

https://huggingface.co/RedHatAI/Qwen3-8B-speculator.dflash#launch-trainingLaunch training

在已安装 speculators 的虚拟环境中

https://huggingface.co/RedHatAI/Qwen3-8B-speculator.dflash#model-specificationsModel Specifications

https://huggingface.co/RedHatAI/Qwen3-8B-speculator.dflash#deploymentDeployment

从所需的 PR 安装 vLLM

使用推测解码部署

https://huggingface.co/RedHatAI/Qwen3-8B-speculator.dflash#preliminary-evaluationsPreliminary Evaluations

https://huggingface.co/RedHatAI/Qwen3-8B-speculator.dflash#referencesReferences

相似文章

@LottoLabs: 这是Dflash为Qwen 3.5/6系列所做的出色工作

@charles_irl：推测就是一切。在这篇博客中，我们宣布与Z Lab共同发布六款最新的DFla…

Ornith 35B 与 Qwen3.6 35B DFlash 推测模型配合效果良好

z-lab/Qwen3.6-35B-A3B-DFlash

RedHatAI/Qwen3.6-35B-A3B-NVFP4

提交意见反馈