@RedHat_AI: Qwen3-8B 现已拥有 DFlash 投机模型!在数学推理任务上首词接受率高达 82.2%,每步平均接受 3.74 个 token……

X AI KOLs Following 模型

摘要

Red Hat AI 发布了用于 Qwen3-8B 的 DFlash 投机模型,在数学推理任务上实现了 82.2% 的首词接受率。该模型使用 Speculators 库和 vLLM 进行训练,以优化推理速度。

Qwen3-8B 现已拥有 DFlash 投机模型!在数学推理任务上首词接受率达 82.2%,每步平均接受 3.74 个 token。基于 Speculators 库构建。训练计算资源由 @modal 赞助。https://huggingface.co/RedHatAI/Qwen3-8B-speculator.dflash…
查看原文
查看缓存全文

缓存时间: 2026/05/11 04:33

Qwen3-8B 现在拥有 DFlash 推测模型!在数学推理任务中首 token 接受率为 82.2%,平均每步接受 3.74 个 token。基于 Speculators 库构建,训练算力由 @modal 赞助。https://huggingface.co/RedHatAI/Qwen3-8B-speculator.dflash…


RedHatAI/Qwen3-8B-speculator.dflash · Hugging Face

来源: https://huggingface.co/RedHatAI/Qwen3-8B-speculator.dflash 这是用于 Qwen/Qwen3-8B (https://huggingface.co/Qwen/Qwen3-8B) 的 DFlash 推测模型。

https://huggingface.co/RedHatAI/Qwen3-8B-speculator.dflash#training-detailsTraining Details

该模型使用 Speculators (https://github.com/vllm-project/speculators) 库,在 Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered (https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered) 的子集以及 HuggingFaceH4/ultrachat_200k (https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) 的 train_sft 分割上训练。响应由 Qwen3-8B(含推理能力)重新生成。该模型的训练算力由 Modal (https://modal.com/) 赞助。

https://huggingface.co/RedHatAI/Qwen3-8B-speculator.dflash#prepare-dataPrepare data

``

在已安装 speculators 的虚拟环境中

python scripts/prepare_data.py
–model Qwen/Qwen3-8B –data ./regenerated_data.jsonl
–output ./output
–seq-length 8192 ``

https://huggingface.co/RedHatAI/Qwen3-8B-speculator.dflash#launch-vllmLaunch vLLM

``

在已安装 vllm 的(独立)虚拟环境中

CUDA_VISIBLE_DEVICES=0,1 vllm_venv/bin/python scripts/launch_vllm.py
Qwen/Qwen3-8B
–target-layer-ids 2 10 18 26 34
– –port 8000
–gpu-memory-utilization 0.9
–disable-uvicorn-access-log
–tensor-parallel-size 1
–data-parallel-size 2 ``

https://huggingface.co/RedHatAI/Qwen3-8B-speculator.dflash#launch-trainingLaunch training

必须在 vLLM 启动完成并在后台运行时执行。

``

在已安装 speculators 的虚拟环境中

CUDA_VISIBLE_DEVICES=2,3 torchrun
–standalone
–nproc_per_node 2
scripts/train.py
–verifier-name-or-path Qwen/Qwen3-8B
–speculator-type dflash
–num-layers 5
–data-path ./output
–vllm-endpoint http://localhost:8000/v1
–save-path ./output/checkpoints
–epochs 3
–lr 0.0006
–total-seq-len 8192
–on-missing generate
–on-generate delete
–seed 42
–log-freq 100
–draft-vocab-size 32000
–draft-arch qwen3
–target-layer-ids 2 10 18 26 34
–draft-hidden-act silu
–scheduler-type cosine
–max-anchors 3072
–prefetch-factor 2
–num-workers 8 ``

https://huggingface.co/RedHatAI/Qwen3-8B-speculator.dflash#model-specificationsModel Specifications

基础模型: Qwen/Qwen3-8B 聊天模板: Qwen/Qwen3-8B(使用 /chat/completions 端点) 格式: Safetensors 许可证: Apache 2.0 验证硬件: Nvidia H100

https://huggingface.co/RedHatAI/Qwen3-8B-speculator.dflash#deploymentDeployment

``

从所需的 PR 安装 vLLM

pip install git+https://github.com/vllm-project/vllm.git@refs/pull/41880/head

使用推测解码部署

vllm serve Qwen/Qwen3-8B \
–tensor-parallel-size 1 \
–max-model-len 16384 \
–speculative-config ‘{
“model”: “RedHatAI/Qwen3-8B-speculator.dflash”,
“num_speculative_tokens”: 7,
“method”: “dflash”
}’ ``

https://huggingface.co/RedHatAI/Qwen3-8B-speculator.dflash#preliminary-evaluationsPreliminary Evaluations

各数据集的位置 token 接受率:(启用推理模式)

DatasetPos 1Pos 2Pos 3Pos 4Pos 5Pos 6Pos 7Avg Length
HumanEval79.9%58.0%40.3%27.0%17.8%11.3%6.8%3.410
math_reasoning82.2%62.7%46.2%33.5%23.4%15.8%9.9%3.740
qa68.9%42.6%25.0%14.4%8.1%4.4%2.3%2.660
question73.0%47.6%30.1%18.9%11.7%7.1%4.1%2.930
rag71.1%44.8%27.0%15.7%8.9%4.9%2.5%2.750
summarization65.5%36.1%19.0%9.5%4.7%2.3%1.1%2.380
tool_call71.3%44.6%25.8%14.4%7.8%4.1%2.1%2.700
translation63.8%38.4%22.1%11.8%6.1%3.2%1.5%2.470
writing73.2%47.7%30.1%18.9%11.8%7.2%4.2%2.930

https://huggingface.co/RedHatAI/Qwen3-8B-speculator.dflash#referencesReferences

论文: DFlash: Block Diffusion for Flash Speculative Decoding (https://arxiv.org/abs/2602.06036)

相似文章

z-lab/Qwen3.6-35B-A3B-DFlash

Hugging Face Models Trending

z-lab 发布 DFlash,一种基于轻量级块扩散模型的投机解码草稿器,可并行生成 15–16 个 token,为 Qwen3.6-35B-A3B 推理带来最高 2.9× 加速。

RedHatAI/Qwen3.6-35B-A3B-NVFP4

Hugging Face Models Trending

Red Hat AI 发布 NVFP4 量化的 35B MoE 版 Qwen3.6,在保持 96.28% GSM8K 精度的同时,通过 vLLM 实现 4-bit 推理。