inference-acceleration

#inference-acceleration

@zhijianliu_: DFlash is now running in a production inference stack. More draft models coming soon. https://github.com/z-lab/dflash

X AI KOLs Following ↗ · 6h ago Cached

DFlash is a lightweight block diffusion model for speculative decoding, now running in production with support for various LLMs like Qwen and Gemma.

0 favorites 1 likes

#inference-acceleration

z-lab/gemma-4-31B-it-DFlash

Hugging Face Models Trending ↗ · 2026-04-30 Cached

Z-lab released DFlash, a speculative decoding drafter model for Gemma-4-31B-it that uses lightweight block diffusion to draft multiple tokens in parallel, achieving up to 5.8x speedup over autoregressive baseline.

0 favorites 0 likes

#inference-acceleration

$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction

arXiv cs.CL ↗ · 2026-04-22 Cached

R²-dLLM introduces spatio-temporal redundancy reduction techniques that cut diffusion LLM decoding steps by up to 75% while preserving generation quality, addressing a key deployment bottleneck.

0 favorites 0 likes

#inference-acceleration

Every AI researcher should grasp inference acceleration—CUDA Graph is the heart of vLLM's GPU efficiency

X AI KOLs Timeline ↗ · 2026-04-21 Cached

A tweet urging AI researchers to learn inference-acceleration basics and spotlighting CUDA Graph as the key to vLLM’s GPU utilization.

0 favorites 0 likes

#inference-acceleration

River-LLM: Large Language Model Seamless Exit Based on KV Share

Hugging Face Daily Papers ↗ · 2026-04-20 Cached

River-LLM proposes a training-free early-exit framework for decoder-only LLMs that uses KV-sharing to eliminate KV-cache gaps, achieving 1.71–2.16× speedup without quality loss.

0 favorites 0 likes

#inference-acceleration

Speculative Decoding for Autoregressive Video Generation

Hugging Face Daily Papers ↗ · 2026-04-19 Cached

SDVG adapts speculative decoding to autoregressive video diffusion, using an image-quality router to achieve up to 2.09× speed-up with 95.7% quality retention on MovieGenVideoBench.

0 favorites 0 likes

#inference-acceleration

z-lab/Qwen3.6-35B-A3B-DFlash

Hugging Face Models Trending ↗ · 2026-04-17 Cached

z-lab releases DFlash, a speculative decoding drafter that uses a lightweight block-diffusion model to draft 15–16 tokens in parallel, yielding up to 2.9× speedup for Qwen3.6-35B-A3B inference.

0 favorites 0 likes

#inference-acceleration

DFlash: Block Diffusion for Flash Speculative Decoding

Papers with Code Trending ↗ · 2026-02-05 Cached

DFlash is a new speculative decoding framework that uses a lightweight block diffusion model for parallel token drafting, achieving over 6x acceleration compared to autoregressive methods. It significantly outperforms existing state-of-the-art methods like EAGLE-3 while maintaining high output quality.

0 favorites 0 likes

inference-acceleration

Submit Feedback