Tag
详细介绍了针对语音克隆模型的W4A4 CUDA内核优化,通过INT4量化和融合LoRA,实现了比FP16快2.6倍的推理速度。
On Friday, we released six new state-of-the-art drafters for accelerated inference, along with a blog post on speculative decoding and a roofline model tool to estimate speedups.
Charles Frye announces the co-release with Z Lab of six new DFlash speculators for Alibaba Qwen 3.x models, achieving over 1k output tokens per second for Qwen 3.5 122B-A10B on a B200.
Modal and Z Lab release six new DFlash speculative decoding draft models for Qwen 3.x, achieving over 1000 tokens per second on a B200 and arguing that speculative decoding is the most impactful inference optimization.
GLM-5.2 adopts MTP (Multi-Token Prediction) technology to accelerate inference and fixes a training-inference discrepancy in GLM-5.1's MTP that caused KV cache mixing issues.
Microsoft Research introduces Next-Latent Prediction (NextLat), a self-supervised method that trains transformers to predict their own next latent state, enabling compact world models for reasoning and planning and achieving up to 3.3x faster inference via self-speculative decoding.
Tensordyne announced a breakthrough inference system using logarithmic math in hardware, claiming 17x more tokens per watt and 13x higher throughput than NVIDIA Blackwell, achieved by replacing complex multiplication with simple addition in log space.
DFlash, a block-diffusion drafter with KV injection, is now running at frontier scale, achieving up to 4.3x greater throughput over baseline, integrated with Modal and SGLang for Qwen 397B.
This paper proposes Prefilling-dLLM, a training-free framework that partitions the prefix into chunks and caches KV representations, achieving state-of-the-art quality and up to 28x speedup for long-context inference in diffusion language models.
Bebop proposes entropy-aware multi-token prediction with rejection sampling and a novel TV loss to accelerate RL training of LLMs, achieving up to 1.8x speedup. The method addresses the degradation of acceptance rates during RL by optimizing training objectives.
Packed Twin Inference (PTI) is a technique that achieves ~2× LLM throughput by running multiple token sequences in a single batch decode, exploiting weight sharing in llama.cpp without needing a draft model or additional VRAM.
Next Forcing introduces a multi-chunk prediction framework for causal world modeling that accelerates training and inference for autoregressive video generation while improving accuracy and physical law adherence.
TBD-VLA introduces a discrete vision-language-action framework that combines block diffusion with autoregressive generation to achieve efficient temporal action modeling and faster inference, significantly outperforming prior VLA approaches in simulation and real-world manipulation tasks.
RhymeFlow accelerates diffusion transformers for video generation by decoupling denoising trajectories across frames, using keyframe anchoring and latent trajectory projection to reduce computational overhead while maintaining visual quality.
This paper introduces Temporal-Spatial Parallel Decoding (TSPD) and Confidence Extrapolation (CE) to accelerate inference in diffusion-based large language models by dynamically deciding when tokens have converged and forecasting logit trends, reducing unnecessary denoising steps while preserving output quality.
Light Interaction introduces a training-free inference acceleration framework for interactive video world models, using adaptive context management, denoising cache acceleration, and 3D block sparse attention to achieve up to 2.59x speedup while maintaining competitive visual quality.
MicroSpec is a training-free technique that builds compact, context-sensitive vocabularies on-the-fly to accelerate speculative decoding in large language models, reducing average vocabulary size by over 40x and achieving up to 1.32x end-to-end speedup over EAGLE-2.
A new method called Zero-Expert Self-Distillation Adaptation (ZEDA) allows MoE models like Qwen3 and GLM to skip half their expert computations on easy tokens with minimal accuracy loss, achieving ~20% inference speedup by adding dummy experts that output nothing.
PulseCol introduces a periodically refreshed column-sparse attention method for diffusion language models, achieving higher sparsity and up to 1.95x end-to-end speedup over FlashAttention while maintaining model quality.
NVIDIA Model Optimizer is a library that compresses deep learning models using techniques like quantization, distillation, pruning, and speculative decoding to accelerate inference. It supports Hugging Face, PyTorch, and ONNX models and integrates with NVIDIA inference frameworks.