Tag
This paper introduces CATS, a cascaded adaptive tree speculation framework designed to accelerate LLM inference on memory-constrained edge devices by optimizing memory usage while maintaining high token acceptance rates.
This paper introduces PARD-2, a dual-mode speculative decoding framework that uses target-aligned parallel draft models to accelerate LLM inference, achieving up to 6.94x lossless acceleration on Llama 3.1-8B.
This paper introduces DARE, a method for improving the inference efficiency of Diffusion Large Language Models by reusing cached key-value and output activations to reduce computational redundancy with negligible quality loss.
This paper introduces SpecBlock, a block-iterative speculative decoding method that combines path dependence with efficient drafting to accelerate LLM inference. It demonstrates improved speedup over existing methods like EAGLE-3 while maintaining lower drafting costs.
This paper introduces Normalizing Trajectory Models (NTM), a novel approach to diffusion-based generation that models reverse steps as conditional normalizing flows with exact likelihood training. NTM enables high-quality text-to-image generation in just four steps while retaining the likelihood framework, outperforming baselines on standard benchmarks.