@zephyr_z9: This is super big I think this is the first useful speculative decoding method deployed on a big quasi frontier model M…

X AI KOLs Following 06/08/26, 04:12 PM Models

speculative-decoding high-speed-inference trillion-parameter xiaomi tile-rt ultra-fast

Summary

Xiaomi MiMo releases MiMo-V2.5-Pro-UltraSpeed, achieving over 1,000 tokens per second on a 1 trillion parameter model using speculative decoding, the first practical deployment of such speed at scale.

This is super big I think this is the first useful speculative decoding method deployed on a big quasi frontier model Massive unlock @fi56622380 https://t.co/augiaFLDOK

Original Article

View Cached Full Text

Cached at: 06/09/26, 10:45 AM

This is super big I think this is the first useful speculative decoding method deployed on a big quasi frontier model Massive unlock @fi56622380 https://t.co/augiaFLDOK

Xiaomi MiMo (@XiaomiMiMo): 🚀 1,000+ TOKENS/S ON A 1T MODEL! 🚀

We are thrilled to release Xiaomi MiMo-V2.5-Pro-UltraSpeed in collaboration with @TileRT_AI , breaking the 1,000 tokens/s output speed on a 1 Trillion parameter model for the FIRST TIME!

Not wafer-scale integration like Cerebras. Not pure

Similar Articles

China's Xiaomi MiMo Is Now 15X Faster Than ChatGPT and Claude (4 minute read)

TLDR AI

Xiaomi achieved over 1,000 tokens per second inference on its trillion-parameter MiMo-V2.5-Pro-UltraSpeed model using commodity 8-GPU nodes via FP4 quantization and DFlash speculative decoding, outpacing GPT-5.5 and Claude Opus by over 10x.

XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash

Hugging Face Models Trending

XiaomiMiMo releases MiMo-V2.5-Pro-FP4-DFlash, an FP4-quantized MoE model with block-diffusion speculative decoding to reduce memory and bandwidth for trillion-parameter inference.

Xiaomi just claimed 1,000+ tps on a 1T model using a standard 8-GPU server

Reddit r/LocalLLaMA

Xiaomi released MiMo-V2.5-Pro-UltraSpeed in collaboration with TileRT, achieving over 1000 tokens/s decode speed on a 1-trillion-parameter model, enabling real-time AI interaction and accelerating coding agents and reasoning tasks.

2X tk/s (from 19.4 -> 38.1 tk/s on 1 x MI50) Playing with a hypothesis like speculative decoding.. but instead of an additional side model, exploiting that I can run multiple computations side-by-side AS IF I had Qwen3.6-27B loaded twice in memory - small quants don't use all the available compute.

Reddit r/LocalLLaMA

Packed Twin Inference (PTI) is a technique that achieves ~2× LLM throughput by running multiple token sequences in a single batch decode, exploiting weight sharing in llama.cpp without needing a draft model or additional VRAM.

XiaomiMiMo/MiMo-V2.5-Pro