@zephyr_z9: This is super big I think this is the first useful speculative decoding method deployed on a big quasi frontier model M…

X AI KOLs Following Models

Summary

Xiaomi MiMo releases MiMo-V2.5-Pro-UltraSpeed, achieving over 1,000 tokens per second on a 1 trillion parameter model using speculative decoding, the first practical deployment of such speed at scale.

This is super big I think this is the first useful speculative decoding method deployed on a big quasi frontier model Massive unlock @fi56622380 https://t.co/augiaFLDOK
Original Article
View Cached Full Text

Cached at: 06/09/26, 10:45 AM

This is super big I think this is the first useful speculative decoding method deployed on a big quasi frontier model Massive unlock @fi56622380 https://t.co/augiaFLDOK

Xiaomi MiMo (@XiaomiMiMo): 🚀 1,000+ TOKENS/S ON A 1T MODEL! 🚀

We are thrilled to release Xiaomi MiMo-V2.5-Pro-UltraSpeed in collaboration with @TileRT_AI , breaking the 1,000 tokens/s output speed on a 1 Trillion parameter model for the FIRST TIME!

Not wafer-scale integration like Cerebras. Not pure

Similar Articles

XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash

Hugging Face Models Trending

XiaomiMiMo releases MiMo-V2.5-Pro-FP4-DFlash, an FP4-quantized MoE model with block-diffusion speculative decoding to reduce memory and bandwidth for trillion-parameter inference.

2X tk/s (from 19.4 -> 38.1 tk/s on 1 x MI50) Playing with a hypothesis like speculative decoding.. but instead of an additional side model, exploiting that I can run multiple computations side-by-side AS IF I had Qwen3.6-27B loaded twice in memory - small quants don't use all the available compute.

Reddit r/LocalLLaMA

Packed Twin Inference (PTI) is a technique that achieves ~2× LLM throughput by running multiple token sequences in a single batch decode, exploiting weight sharing in llama.cpp without needing a draft model or additional VRAM.

XiaomiMiMo/MiMo-V2.5-Pro

Hugging Face Models Trending

Xiaomi releases MiMo-V2.5-Pro, an open-source MoE language model with 1.02T total parameters and 1M token context, optimized for complex agentic and software engineering tasks.