acceleration

#acceleration

CATS: Cascaded Adaptive Tree Speculation for Memory-Limited LLM Inference Acceleration

arXiv cs.LG ↗ · 12h ago Cached

This paper introduces CATS, a cascaded adaptive tree speculation framework designed to accelerate LLM inference on memory-constrained edge devices by optimizing memory usage while maintaining high token acceptance rates.

0 favorites 0 likes

#acceleration

PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding

arXiv cs.CL ↗ · yesterday Cached

This paper introduces PARD-2, a dual-mode speculative decoding framework that uses target-aligned parallel draft models to accelerate LLM inference, achieving up to 6.94x lossless acceleration on Llama 3.1-8B.

0 favorites 0 likes

#acceleration

DARE: Diffusion Language Model Activation Reuse for Efficient Inference

arXiv cs.LG ↗ · yesterday Cached

This paper introduces DARE, a method for improving the inference efficiency of Diffusion Large Language Models by reusing cached key-value and output activations to reduce computational redundancy with negligible quality loss.

0 favorites 0 likes

#acceleration

SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting

arXiv cs.CL ↗ · 2d ago Cached

This paper introduces SpecBlock, a block-iterative speculative decoding method that combines path dependence with efficient drafting to accelerate LLM inference. It demonstrates improved speedup over existing methods like EAGLE-3 while maintaining lower drafting costs.

0 favorites 0 likes

#acceleration

Normalizing Trajectory Models

Hugging Face Daily Papers ↗ · 5d ago Cached

This paper introduces Normalizing Trajectory Models (NTM), a novel approach to diffusion-based generation that models reverse steps as conditional normalizing flows with exact likelihood training. NTM enables high-quality text-to-image generation in just four steps while retaining the likelihood framework, outperforming baselines on standard benchmarks.

0 favorites 0 likes

acceleration

CATS: Cascaded Adaptive Tree Speculation for Memory-Limited LLM Inference Acceleration

PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding

DARE: Diffusion Language Model Activation Reuse for Efficient Inference

SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting

Normalizing Trajectory Models

Submit Feedback