Tag
A GGUF conversion of MiniMax M3's EAGLE draft model for llama.cpp is now available, enabling speculative decoding speedups on compatible hardware.
Speculative decoding is an inference optimization technique that uses a fast draft model to propose future tokens verified in parallel by a larger model, improving LLM generation speed. The article highlights its trending status on Papers with Code and a recent SGLang blog post about state-of-the-art latencies using DFlash models.
Discussion of different flavors of speculative decoding and an attempt to produce a Qwen-3.6-27b EAGLE-3 drafter for the community.
Proposes PPOW, a reinforcement learning framework for optimizing draft models in speculative decoding using window-level objectives and adaptive windowing, achieving significant speedups across multiple benchmarks.
SlimSpec introduces a low-rank parameterization for drafter LM-heads to accelerate speculative decoding in LLMs, achieving 4-5x speedup while maintaining full vocabulary support.