TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload
Summary
TIDE is a lossless inference system for diffusion large language models that leverages temporal stability of expert activations to reduce I/O overhead and computation, achieving up to 1.4-1.5x throughput improvements on single GPU-CPU systems.
View Cached Full Text
Cached at: 05/22/26, 02:36 AM
Paper page - TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload
Source: https://huggingface.co/papers/2605.20179
Abstract
Diffusion large language models face deployment challenges on resource-constrained devices, but a new inference system called TIDE addresses this by leveraging temporal stability of expert activations and optimizing expert placement to reduce I/O overhead and computation.
Diffusion Large Language Models(dLLMs) have emerged as a competitive alternative to autoregressive (AR) models, offering better hardware utilization and bidirectional context through parallelblock-level decoding. However, as dLLMs continue to scale up withmixture-of-experts(MoE) architectures, their deployment on resource-constrained devices remains an open challenge. Existing AR-based methods often incur either prohibitive I/O overhead or significant compute bottlenecks. In this work, we propose TIDE, a novel resource-efficient inference system that leverages thetemporal stabilityofexpert activationsduring the diffusion process within the block. Specifically, we leverage thetemporal stabilityofexpert activationsduring the diffusion process within the block and introduce aninterval-based expert refreshstrategy that updates the expert placement in anI/O-awarefashion. To ensure optimal performance, we formulate theinference schedulingas amathematical programmingproblem, solving for the optimal interval that minimizes I/O traffic and CPU computation. Most importantly, TIDE is a lossless optimization that requires no model training, providing a “free lunch” acceleration for dLLM inference. In a single GPU-CPU system, we demonstrate that TIDE achieves up to 1.4times and 1.5timesthroughput improvements over prior baselines on LLaDA2.0-mini and LLaDA2.0-flash models, respectively.
View arXiv pageView PDFProject pageGitHub2Add to collection
Get this paper in your agent:
hf papers read 2605\.20179
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.20179 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.20179 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.20179 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
@snowboat84: https://x.com/snowboat84/status/2065215177029787705
This article is the middle part of the AI Engineering Landscape series, detailing core techniques such as inference optimization, model slimming (quantization, distillation, pruning, MoE), and speculative decoding, while reviewing the latest advances from hardware to the engineering stack.
Adaptive Tokenisation Via Temporal Redundancy Masking And Latent Inpainting [R]
This paper introduces an adaptive video tokenisation method that exploits temporal redundancy in latent space to allocate tokens dynamically, achieving efficient compression without auxiliary networks. The proposed Latent Inpainting Transformer reconstructs dropped positions, delivering 31x speedup over ElasticTok-CV and 2x over InfoTok.
"How NVIDIA Built Nemotron 3 Open Model" by "Caleb Writes Code" x "Joey Conway"
NVIDIA released the Nemotron 3 open model, offering three sizes: Nano, Super, and Ultra. It optimizes hardware efficiency through architectural innovations such as hybrid Mamba Transformer, latent MoE, and multi-token prediction, and adopts the Open MDW 1.1 open license.
CRUMB: Efficient Prior Fitted Network Inference via Distributionally Matched Context Batching
This paper proposes CRUMB, a three-stage inference wrapper that clusters test queries and selects a distributionally matched training subset via MMD minimization to enable efficient Prior-Fitted Network inference on large datasets, achieving state-of-the-art context selection on 51 TabArena datasets.
DiffusionGemma: The Developer Guide- Google Developers Blog
DiffusionGemma is a new experimental model from Google DeepMind that uses parallel generation on a 256-token canvas, achieving up to 4x faster token generation on GPUs. This developer guide explains its architecture, bidirectional context, and includes a fine-tuning recipe for solving Sudoku.