inference-speedup

#inference-speedup

dMoE: dLLMs with Learnable Block Experts

Hugging Face Daily Papers ↗ · 2026-05-29 Cached

This paper proposes dMoE, a block-level mixture-of-experts framework for diffusion large language models that aggregates token-level expert distributions into block-level routing, reducing activated experts and memory usage while maintaining performance.

0 favorites 0 likes

#inference-speedup

Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding

Hugging Face Daily Papers ↗ · 2026-05-28 Cached

Domino is a speculative decoding framework that decouples causal dependency modeling from autoregressive drafting, using a parallel backbone and lightweight causal refinement head to achieve up to 5.49× end-to-end speedup on Qwen3 models.

0 favorites 0 likes

#inference-speedup

Stage-adaptive Token Selection for Efficient Omni-modal LLMs

Hugging Face Daily Papers ↗ · 2026-05-19 Cached

SEATS is a training-free, stage-adaptive token selection method that reduces computational overhead in omni-modal LLMs by progressively pruning redundant visual and audio tokens, achieving a 9.3x FLOPs reduction and 4.8x prefill speedup while preserving 96.3% performance.

0 favorites 0 likes

#inference-speedup

Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding

Hugging Face Daily Papers ↗ · 2026-05-19 Cached

Graft is a training-free framework that enhances speculative decoding by combining pruning and retrieval to improve acceptance rates and inference speed, achieving up to 5.41x speedup on short-context benchmarks and up to 21.8% improvement over EAGLE-3 on Qwen3-235B.

0 favorites 0 likes

#inference-speedup

Post-Trained MoE Can Skip Half Experts via Self-Distillation

Hugging Face Daily Papers ↗ · 2026-05-18 Cached

ZEDA is a low-cost framework that converts post-trained static MoE models into dynamic ones by injecting zero-output experts and using self-distillation, achieving over 50% expert FLOP reduction with marginal accuracy loss on benchmarks.

0 favorites 0 likes

#inference-speedup

Factorization-Error-Free Discrete Diffusion Language Model via Speculative Decoding

arXiv cs.CL ↗ · 2026-05-15 Cached

This paper introduces FeF-DLLM, a discrete diffusion language model that eliminates factorization errors by using exact prefix-conditioned factorization and accelerates inference via speculative decoding, achieving significant improvements in accuracy and speed on benchmarks such as GSM8K and MATH.

0 favorites 0 likes

#inference-speedup

Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion

Hugging Face Daily Papers ↗ · 2026-05-12 Cached

Orthrus is a dual-architecture framework that combines autoregressive LLMs with diffusion models for fast parallel token generation while maintaining exact inference fidelity via shared KV caches and consensus mechanisms, achieving up to 7.8x speedup.

0 favorites 0 likes

#inference-speedup

z-lab/dflash

GitHub Trending (daily) ↗ · 2026-05-08

DFlash introduces a block diffusion method for flash speculative decoding to enhance inference speed in large language models.

0 favorites 0 likes

inference-speedup

Submit Feedback