TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload

Hugging Face Daily Papers 05/19/26, 12:00 AM Papers

Summary

TIDE is a lossless inference system for diffusion large language models that leverages temporal stability of expert activations to reduce I/O overhead and computation, achieving up to 1.4-1.5x throughput improvements on single GPU-CPU systems.

Diffusion Large Language Models (dLLMs) have emerged as a competitive alternative to autoregressive (AR) models, offering better hardware utilization and bidirectional context through parallel block-level decoding. However, as dLLMs continue to scale up with mixture-of-experts (MoE) architectures, their deployment on resource-constrained devices remains an open challenge. Existing AR-based methods often incur either prohibitive I/O overhead or significant compute bottlenecks. In this work, we propose TIDE, a novel resource-efficient inference system that leverages the temporal stability of expert activations during the diffusion process within the block. Specifically, we leverage the temporal stability of expert activations during the diffusion process within the block and introduce an interval-based expert refresh strategy that updates the expert placement in an I/O-aware fashion. To ensure optimal performance, we formulate the inference scheduling as a mathematical programming problem, solving for the optimal interval that minimizes I/O traffic and CPU computation. Most importantly, TIDE is a lossless optimization that requires no model training, providing a "free lunch" acceleration for dLLM inference. In a single GPU-CPU system, we demonstrate that TIDE achieves up to 1.4times and 1.5times throughput improvements over prior baselines on LLaDA2.0-mini and LLaDA2.0-flash models, respectively.

Original Article

View Cached Full Text

Cached at: 05/22/26, 02:36 AM

Paper page - TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload

Source: https://huggingface.co/papers/2605.20179

Abstract

Diffusion large language models face deployment challenges on resource-constrained devices, but a new inference system called TIDE addresses this by leveraging temporal stability of expert activations and optimizing expert placement to reduce I/O overhead and computation.

Diffusion Large Language Models(dLLMs) have emerged as a competitive alternative to autoregressive (AR) models, offering better hardware utilization and bidirectional context through parallelblock-level decoding. However, as dLLMs continue to scale up withmixture-of-experts(MoE) architectures, their deployment on resource-constrained devices remains an open challenge. Existing AR-based methods often incur either prohibitive I/O overhead or significant compute bottlenecks. In this work, we propose TIDE, a novel resource-efficient inference system that leverages thetemporal stabilityofexpert activationsduring the diffusion process within the block. Specifically, we leverage thetemporal stabilityofexpert activationsduring the diffusion process within the block and introduce aninterval-based expert refreshstrategy that updates the expert placement in anI/O-awarefashion. To ensure optimal performance, we formulate theinference schedulingas amathematical programmingproblem, solving for the optimal interval that minimizes I/O traffic and CPU computation. Most importantly, TIDE is a lossless optimization that requires no model training, providing a “free lunch” acceleration for dLLM inference. In a single GPU-CPU system, we demonstrate that TIDE achieves up to 1.4times and 1.5timesthroughput improvements over prior baselines on LLaDA2.0-mini and LLaDA2.0-flash models, respectively.

View arXiv page View PDF Project page GitHub2 Add to collection

Get this paper in your agent:

hf papers read 2605\.20179

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.20179 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.20179 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.20179 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload

Paper page - TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

@snowboat84: https://x.com/snowboat84/status/2065215177029787705

Adaptive Tokenisation Via Temporal Redundancy Masking And Latent Inpainting [R]

"How NVIDIA Built Nemotron 3 Open Model" by "Caleb Writes Code" x "Joey Conway"

CRUMB: Efficient Prior Fitted Network Inference via Distributionally Matched Context Batching

DiffusionGemma: The Developer Guide- Google Developers Blog

Submit Feedback

Similar Articles

@snowboat84: https://x.com/snowboat84/status/2065215177029787705

Adaptive Tokenisation Via Temporal Redundancy Masking And Latent Inpainting [R]

"How NVIDIA Built Nemotron 3 Open Model" by "Caleb Writes Code" x "Joey Conway"

CRUMB: Efficient Prior Fitted Network Inference via Distributionally Matched Context Batching

DiffusionGemma: The Developer Guide- Google Developers Blog