Tag
This paper proposes dMoE, a block-level mixture-of-experts framework for diffusion large language models that aggregates token-level expert distributions into block-level routing, reducing activated experts and memory usage while maintaining performance.
Domino is a speculative decoding framework that decouples causal dependency modeling from autoregressive drafting, using a parallel backbone and lightweight causal refinement head to achieve up to 5.49× end-to-end speedup on Qwen3 models.
SEATS is a training-free, stage-adaptive token selection method that reduces computational overhead in omni-modal LLMs by progressively pruning redundant visual and audio tokens, achieving a 9.3x FLOPs reduction and 4.8x prefill speedup while preserving 96.3% performance.
Graft is a training-free framework that enhances speculative decoding by combining pruning and retrieval to improve acceptance rates and inference speed, achieving up to 5.41x speedup on short-context benchmarks and up to 21.8% improvement over EAGLE-3 on Qwen3-235B.
ZEDA is a low-cost framework that converts post-trained static MoE models into dynamic ones by injecting zero-output experts and using self-distillation, achieving over 50% expert FLOP reduction with marginal accuracy loss on benchmarks.
This paper introduces FeF-DLLM, a discrete diffusion language model that eliminates factorization errors by using exact prefix-conditioned factorization and accelerates inference via speculative decoding, achieving significant improvements in accuracy and speed on benchmarks such as GSM8K and MATH.
Orthrus is a dual-architecture framework that combines autoregressive LLMs with diffusion models for fast parallel token generation while maintaining exact inference fidelity via shared KV caches and consensus mechanisms, achieving up to 7.8x speedup.
DFlash introduces a block diffusion method for flash speculative decoding to enhance inference speed in large language models.