XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash
Summary
XiaomiMiMo releases MiMo-V2.5-Pro-FP4-DFlash, an FP4-quantized MoE model with block-diffusion speculative decoding to reduce memory and bandwidth for trillion-parameter inference.
View Cached Full Text
Cached at: 06/12/26, 02:53 AM
XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash · Hugging Face
Source: https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash

MiMo-V2.5-Pro-FP4-DFlash is the underlying model that powers MiMo-V2.5-Pro-UltraSpeed:
- AnFP4-quantized backbonethat applies MXFP4 quantization to the MoE experts while keeping the rest of the model at higher precision, shrinking model size and memory-bandwidth pressure with near-lossless quality.
- ABF16 DFlash drafterfor block-diffusion speculative decoding, which proposes a whole block of tokens per forward pass and lets the backbone verify them in one step.
Together they cut both the per-parameter bit width and the number of backbone forward passes, the two dominant costs of trillion-parameter decoding.
https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash#1-introduction1. Introduction
At the trillion-parameter (1T) scale, even 8-bit (FP8/INT8) inference carries severe memory-footprint and memory-bandwidth costs. Lowering the parameter bit width translates directly into faster decoding. We therefore adopt FP4 quantization and block-diffusion speculative decoding. Key features of this release:
- Expert-Only FP4 Quantization: A blanket FP4 cast over the whole model tends to degrade accuracy and generalization on complex reasoning and code. Given MiMo-V2.5-Pro’s MoE architecture where experts hold the vast majority of parameters and tolerate quantization best, we quantize only the MoE experts to FP4 (MXFP4) and keep the other modules at their original precision. Through FP4 QAT, the model retains near-lossless capability while substantially reducing size and saturating hardware bandwidth.
- DFlash Speculative Decoding: A lightweight block-diffusion drafter fills an entire block of masked positions in a single forward pass, removing the serial draft autoregression bottleneck of conventional speculative decoding while the backbone’s verification preserves output quality.
https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash#2-fp4-quantization2. FP4 Quantization
We quantize only the MoE experts to MXFP4 (block size 32) and keep attention projections and other modules at higher precision (the attentiono\_projof every layer is excluded from FP4). With FP4 QAT, quality stays close to the FP8 baseline:

BenchmarkMiMo-V2.5-Pro-FP8MiMo-V2.5-Pro-MXFP4ΔGeneral AgentClaw-Eval (pass^3)63.867.8+6.27%Humanity’s Last Exam48.047.0-2.08%Humanity’s Last Exam (without tool)34.033.0-2.94%Code AgentSWE-Bench Pro57.258.8+2.80%SWE-bench Verified78.977.4-1.90%
https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash#3-block-diffusion-speculative-decoding-dflash3. Block-Diffusion Speculative Decoding (DFlash)
Conventional speculative decoding relies on a small draft model to guess the next tokens, which the large model then verifies; the rejection-sampling verification keeps the output lossless. Its bottleneck is that draft quality bounds the acceptance rate, while a stronger draft costs more compute.
To break this trade-off we adopt the block-level masked parallel-prediction approach DFlash: the draft fills an entire block of masked positions in one forward pass. We landed this on MiMo-V2.5-Pro with custom optimizations for trillion-scale MoE and long-context serving, using the Muon second-order optimizer and model self-distillation so that even a small mask block keeps a strong acceptance rate while pushing the draft-stage cost close to its limit:
- The drafter uses Sliding Window Attention (SWA) throughout, naturally aligned with the SWA design of the MiMo-V2 series. The draft no longer depends on the full prefix, so the per-prediction compute moves from linear-in-context-length to constant.
- During training the mask signal is sampled on the local GPU shard, so a single sequence yields tens of thousands of independent training signals covering positions at different context lengths in one step, aligning with the MiMo-V2 series’ long-context capability while avoiding cross-device communication overhead.
In practice, we further cap the mask block size at 8 to lower verification overhead and raise concurrency.
ScenarioAcceptance LengthWebDev6.30Math5005.56HumanEval4.54MT-Bench3.18SWE-Bench4.29
https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash#4-model-summary4. Model Summary
ComponentBackboneDFlash DrafterArchitectureMiMoV2ForCausalLMDFlashDraftModelTotal / Active Params1.02T / 42B5-layer draftHidden Size61446144Num Layers705Num Attention Heads128128Num KV Heads8 (GQA)8 (GQA)Head Dim (QK / V)192 / 128128 / 128SWA Window Size1281024Block Size—8Captured Backbone Layers—[0, 15, 31, 47, 69]Backbone RoPE Base5,000,0005,000,000PrecisionMXFP4 (experts) MixedBF16Max Context Length1M—
https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash#5-deployment5. Deployment
DFlash inference with the FP4 backbone is supported in SGLang. The drafter is launched alongside the backbone via the speculative-decoding flags and inherits the backbone’s tensor/expert-parallel topology.
https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash#sglang-deploymentSGLang Deployment
The following is an example of running the model with SGLang. Point\-\-modelat this repository and\-\-speculative\-draft\-model\-pathat itsdflash/subdirectory.
python3 -m sglang.launch_server \
--model MiMo-V2.5-Pro-FP4-DFlash \
--speculative-algorithm DFLASH \
--speculative-draft-model-path MiMo-V2.5-Pro-FP4-DFlash/dflash \
--speculative-num-draft-tokens 8 \
--ep-size 16 \
--tensor-parallel-size 16 \
--data-parallel-size 2 \
--enable-dp-attention \
--enable-dp-lm-head \
--quantization fp8 \
--attention-backend fa3 \
--moe-dense-tp-size 1 \
--dtype bfloat16 \
--mem-fraction-static 0.65 \
--context-length 65536 \
--page-size 1 \
--trust-remote-code \
--disable-overlap-schedule \
--skip-server-warmup \
--dist-init-addr ${MASTER_ADDR}:20000 \
--nnodes ${WORLD_SIZE} \
--node-rank ${RANK} \
--host 0.0.0.0 \
--port 29999
https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash#citationCitation
@misc{mimo2026v25pro_fp4dflash,
title={MiMo-V2.5-Pro-FP4-DFlash},
author={{Xiaomi MiMo Team}},
year={2026},
howpublished={\url{https://huggingface.co/collections/XiaomiMiMo/mimo-v25}},
}
https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash#contactContact
For questions or feedback, reach us at[email protected]or join our community:
Similar Articles
XiaomiMiMo/MiMo-V2.5-Pro
Xiaomi releases MiMo-V2.5-Pro, an open-source MoE language model with 1.02T total parameters and 1M token context, optimized for complex agentic and software engineering tasks.
@zephyr_z9: This is super big I think this is the first useful speculative decoding method deployed on a big quasi frontier model M…
Xiaomi MiMo releases MiMo-V2.5-Pro-UltraSpeed, achieving over 1,000 tokens per second on a 1 trillion parameter model using speculative decoding, the first practical deployment of such speed at scale.
MiMo-V2.5-coder
A quantized GGUF build of Xiaomi's MiMo-V2.5 model, tuned for coding and tool-calling on 128GB Apple Silicon systems, prioritizing reliable tool calls and code generation.
Xiaomi released their SOTA model, MiMo-V2.5-Pro.
Xiaomi launched MiMo-V2.5-Pro, claiming state-of-the-art performance.
China's Xiaomi MiMo Is Now 15X Faster Than ChatGPT and Claude (4 minute read)
Xiaomi achieved over 1,000 tokens per second inference on its trillion-parameter MiMo-V2.5-Pro-UltraSpeed model using commodity 8-GPU nodes via FP4 quantization and DFlash speculative decoding, outpacing GPT-5.5 and Claude Opus by over 10x.