RedHatAI/Qwen3.6-35B-A3B-NVFP4

Hugging Face Models Trending 模型

摘要

Red Hat AI 发布 NVFP4 量化的 35B MoE 版 Qwen3.6,在保持 96.28% GSM8K 精度的同时,通过 vLLM 实现 4-bit 推理。

标签:safetensors, qwen3_5_moe, qwen, nvfp4, vllm, compressed-tensors, base_model:Qwen/Qwen3.6-35B-A3B, base_model:quantized:Qwen/Qwen3.6-35B-A3B, region:us
查看原文 导出为 Word 导出为 PDF
查看缓存全文

缓存时间: 2026/04/22 14:17

RedHatAI/Qwen3.6-35B-A3B-NVFP4 · Hugging Face

来源:https://huggingface.co/RedHatAI/Qwen3.6-35B-A3B-NVFP4
这是 Qwen/Qwen3.6-35B-A3B(https://huggingface.co/Qwen/Qwen3.6-35B-A3B)的 NVFP4 量化初步版本(后续可能更新)。权重与激活值均用 vllm-project/llm-compressor(https://github.com/vllm-project/llm-compressor)量化为 NVFP4 格式。

已通过 vllm main 分支兼容性测试,部署命令:
vllm serve RedHatAI/Qwen3.6-35B-A3B-NVFP4 --reasoning-parser qwen3 --moe_backend flashinfer_cutlass

https://huggingface.co/RedHatAI/Qwen3.6-35B-A3B-NVFP4#creation-script生成脚本:

请使用 LLM Compressor main 分支与最新版 transformers 运行。

import torch
from compressed_tensors.utils import save_mtp_tensors_to_checkpoint
from datasets import load_dataset
from transformers import AutoProcessor, Qwen3_5MoeForConditionalGeneration

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

# 注意:本示例需要 transformers >= v5

MODEL_ID = "Qwen/Qwen3.6-35B-A3B"

# 加载模型
model = Qwen3_5MoeForConditionalGeneration.from_pretrained(MODEL_ID, dtype="auto")
processor = AutoProcessor.from_pretrained(MODEL_ID)

# 无需包含 mtp 层,它们不会通过 Qwen3_5MoeForConditionalGeneration 加载
recipe = QuantizationModifier(
    targets="Linear",
    scheme="NVFP4",
    ignore=[
        "re:.*lm_head",
        "re:visual.*",
        "re:model.visual.*",
        "re:.*mlp.gate$",
        "re:.*embed_tokens$",
        "re:.*shared_expert_gate$",
        "re:.*linear_attn.*",
    ],
)

NUM_CALIBRATION_SAMPLES = 256
MAX_SEQUENCE_LENGTH = 4096

ds = load_dataset(
    "HuggingFaceH4/ultrachat_200k",
    split=f"train_sft[:{NUM_CALIBRATION_SAMPLES}]",
)
ds = ds.select_columns(["messages"])
ds = ds.shuffle(seed=42)

def preprocess_function(example):
    messages = [
        {"role": m["role"], "content": [{"type": "text", "text": m["content"]}]}
        for m in example["messages"]]
    return processor.apply_chat_template(
        messages,
        tokenize=True,
        return_dict=True,
        add_generation_prompt=False,
        processor_kwargs={
            "return_tensors": "pt",
            "padding": False,
            "truncation": True,
            "max_length": MAX_SEQUENCE_LENGTH,
            "add_special_tokens": False,
        },
    )

ds = ds.map(preprocess_function, batched=False, remove_columns=ds.column_names)

def data_collator(batch):
    assert len(batch) == 1
    return {key: torch.tensor(value) for key, value in batch[0].items()}

# 执行量化
oneshot(
    model=model,
    recipe=recipe,
    dataset=ds,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    moe_calibrate_all_experts=True,
    data_collator=data_collator,
)

# 以 compressed-tensors 格式保存
SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-NVFP4"
model.save_pretrained(SAVE_DIR)
processor.save_pretrained(SAVE_DIR)

# MTP 层被排除在 Qwen3_5MoeForConditionalGeneration 之外
# 直接从原始检查点保存到量化输出目录
save_mtp_tensors_to_checkpoint(source_model=MODEL_ID, dest_dir=SAVE_DIR)

https://huggingface.co/RedHatAI/Qwen3.6-35B-A3B-NVFP4#preliminary-evaluations初步评测

  1. GSM8K Platinum:
lm_eval --model local-chat-completions \
  --tasks gsm8k_platinum_cot_llama \
  --model_args "model=RedHatAI/Qwen3.6-35B-A3B-NVFP4,max_length=262144,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=1200" \
  --num_fewshot 0 \
  --apply_chat_template \
  --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=20,min_p=0.0,max_gen_toks=64000,presence_penalty=1.5,repetition_penalty=1.0,seed=5678"

恢复率:

模型准确率
Qwen/Qwen3.6-35B-A3B95.62
RedHatAI/Qwen3.6-35B-A3B-NVFP4(本模型)96.28
恢复率100.69%

:更严格的全面评测正在进行,结果将尽快公布。

相似文章

Qwen/Qwen3.6-27B-FP8

Hugging Face Models Trending

阿里巴巴发布 Qwen3.6-27B-FP8,一款 27B 参数的 FP8 量化模型,在代理式编码与推理基准上表现强劲,现已上架 Hugging Face。

Qwen/Qwen3.6-35B-A3B-FP8

Hugging Face Models Trending

阿里巴巴发布了Qwen3.6-35B-A3B-FP8,这是Qwen3.6的开源权重量化变体,拥有35B参数,通过MoE激活3B,具有改进的智能编码能力和保持思维链的迭代开发特性。