inference-acceleration

#inference-acceleration

@charles_irl: https://x.com/charles_irl/status/2069113412869914944

X AI KOLs Timeline ↗ · 2d ago Cached

详细介绍了针对语音克隆模型的W4A4 CUDA内核优化，通过INT4量化和融合LoRA，实现了比FP16快2.6倍的推理速度。

0 favorites 0 likes

#inference-acceleration

@charles_irl: On Friday, we released six new state-of-the-art drafters for accelerated inference. We also put out a blog post on why …

X AI KOLs Following ↗ · 3d ago Cached

On Friday, we released six new state-of-the-art drafters for accelerated inference, along with a blog post on speculative decoding and a roofline model tool to estimate speedups.

0 favorites 0 likes

#inference-acceleration

@LottoLabs: This is awesome work Dflash for qwen 3.5/6 series

X AI KOLs Timeline ↗ · 5d ago Cached

Charles Frye announces the co-release with Z Lab of six new DFlash speculators for Alibaba Qwen 3.x models, achieving over 1k output tokens per second for Qwen 3.5 122B-A10B on a B200.

0 favorites 0 likes

#inference-acceleration

@charles_irl: Speculation Is All You Need. In this blog post, we announce the co-release (w/ Z Lab) of six more state-of-the-art DFla…

X AI KOLs Following ↗ · 5d ago Cached

Modal and Z Lab release six new DFlash speculative decoding draft models for Qwen 3.x, achieving over 1000 tokens per second on a B200 and arguing that speculative decoding is the most impactful inference optimization.

0 favorites 0 likes

#inference-acceleration

@jakevin7: Recently I've been reading about GLM 5.2 and found some interesting things to share. GLM-5.2 uses MTP (Multi-Token Prediction) to accelerate inference: a lightweight "draft model" quickly predicts multiple tokens, then the main model verifies them all at once; if accepted, it skips the decoding steps.

X AI KOLs Following ↗ · 5d ago Cached

GLM-5.2 adopts MTP (Multi-Token Prediction) technology to accelerate inference and fixes a training-inference discrepancy in GLM-5.1's MTP that caused KV cache mixing issues.

0 favorites 0 likes

#inference-acceleration

Next-Latent Prediction Transformers [R]

Reddit r/MachineLearning ↗ · 2026-06-17

Microsoft Research introduces Next-Latent Prediction (NextLat), a self-supervised method that trains transformers to predict their own next latent state, enabling compact world models for reasoning and planning and achieving up to 3.3x faster inference via self-speculative decoding.

0 favorites 0 likes

#inference-acceleration

Tensordyne announces Logarithmic AI compute chips. 17x more tokens per watt and 13x higher throughput than NVIDIA Blackwell.

Reddit r/singularity ↗ · 2026-06-15

Tensordyne announced a breakthrough inference system using logarithmic math in hardware, claiming 17x more tokens per watt and 13x higher throughput than NVIDIA Blackwell, achieved by replacing complex multiplication with simple addition in log space.

0 favorites 0 likes

#inference-acceleration

@zhijianliu_: This is what DFlash was built for. Our block-diffusion drafter + KV injection, now running at frontier scale — thanks t…

X AI KOLs Following ↗ · 2026-06-15 Cached

DFlash, a block-diffusion drafter with KV injection, is now running at frontier scale, achieving up to 4.3x greater throughput over baseline, integrated with Modal and SGLang for Qwen 397B.

0 favorites 0 likes

#inference-acceleration

Prefilling-dLLM: Predictive Prefilling for Long-Context Inference in Diffusion Language Models

arXiv cs.CL ↗ · 2026-06-10 Cached

This paper proposes Prefilling-dLLM, a training-free framework that partitions the prefix into chunks and caches KV representations, achieving state-of-the-art quality and up to 28x speedup for long-context inference in diffusion language models.

0 favorites 0 likes

#inference-acceleration

Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling

Hugging Face Daily Papers ↗ · 2026-06-10 Cached

Bebop proposes entropy-aware multi-token prediction with rejection sampling and a novel TV loss to accelerate RL training of LLMs, achieving up to 1.8x speedup. The method addresses the degradation of acceptance rates during RL by optimizing training objectives.

0 favorites 0 likes

#inference-acceleration

2X tk/s (from 19.4 -> 38.1 tk/s on 1 x MI50) Playing with a hypothesis like speculative decoding.. but instead of an additional side model, exploiting that I can run multiple computations side-by-side AS IF I had Qwen3.6-27B loaded twice in memory - small quants don't use all the available compute.

Reddit r/LocalLLaMA ↗ · 2026-06-09 Cached

Packed Twin Inference (PTI) is a technique that achieves ~2× LLM throughput by running multiple token sequences in a single batch decode, exploiting weight sharing in llama.cpp without needing a draft model or additional VRAM.

0 favorites 0 likes

#inference-acceleration

Next Forcing: Causal World Modeling with Multi-Chunk Prediction

Hugging Face Daily Papers ↗ · 2026-06-09 Cached

Next Forcing introduces a multi-chunk prediction framework for causal world modeling that accelerates training and inference for autoregressive video generation while improving accuracy and physical law adherence.

0 favorites 0 likes

#inference-acceleration

TBD-VLA: Temporal Block Diffusion Vision Language Action Model

Hugging Face Daily Papers ↗ · 2026-06-05 Cached

TBD-VLA introduces a discrete vision-language-action framework that combines block diffusion with autoregressive generation to achieve efficient temporal action modeling and faster inference, significantly outperforming prior VLA approaches in simulation and real-world manipulation tasks.

0 favorites 0 likes

#inference-acceleration

RhymeFlow: Training-Free Acceleration for Video Generation with Asynchronous Denoising Flow Scheduling

Hugging Face Daily Papers ↗ · 2026-06-04 Cached

RhymeFlow accelerates diffusion transformers for video generation by decoupling denoising trajectories across frames, using keyframe anchoring and latent trajectory projection to reduce computational overhead while maintaining visual quality.

0 favorites 0 likes

#inference-acceleration

Efficient Diffusion LLMs via Temporal-Spatial Parallel Decoding and Confidence Extrapolation

arXiv cs.CL ↗ · 2026-06-01 Cached

This paper introduces Temporal-Spatial Parallel Decoding (TSPD) and Confidence Extrapolation (CE) to accelerate inference in diffusion-based large language models by dynamically deciding when tokens have converged and forecasting logit trends, reducing unnecessary denoising steps while preserving output quality.

0 favorites 0 likes

#inference-acceleration

Light Interaction: Training-Free Inference Acceleration for Interactive Video World Models

Hugging Face Daily Papers ↗ · 2026-05-29 Cached

Light Interaction introduces a training-free inference acceleration framework for interactive video world models, using adaptive context management, denoising cache acceleration, and 3D block sparse attention to achieve up to 2.59x speedup while maintaining competitive visual quality.

0 favorites 0 likes

#inference-acceleration

MicroSpec: Accelerating Speculative Decoding with Lightweight In-Context Vocabularies

arXiv cs.CL ↗ · 2026-05-27 Cached

MicroSpec is a training-free technique that builds compact, context-sensitive vocabularies on-the-fly to accelerate speculative decoding in large language models, reducing average vocabulary size by over 40x and achieving up to 1.32x end-to-end speedup over EAGLE-2.

0 favorites 0 likes

#inference-acceleration

@rohanpaul_ai: A large MoE model may be wasting half its expert compute on tokens that barely need expert help. In this paper 50% of e…

X AI KOLs Timeline ↗ · 2026-05-24 Cached

A new method called Zero-Expert Self-Distillation Adaptation (ZEDA) allows MoE models like Qwen3 and GLM to skip half their expert computations on easy tokens with minimal accuracy loss, achieving ~20% inference speedup by adding dummy experts that output nothing.

0 favorites 0 likes

#inference-acceleration

PulseCol: Periodically Refreshed Column-Sparse Attention for Accelerating Diffusion Language Models

arXiv cs.CL ↗ · 2026-05-21 Cached

PulseCol introduces a periodically refreshed column-sparse attention method for diffusion language models, achieving higher sparsity and up to 1.95x end-to-end speedup over FlashAttention while maintaining model quality.

0 favorites 0 likes

#inference-acceleration

@tom_doerr: Compresses deep learning models for faster inference https://github.com/NVIDIA/Model-Optimizer…

X AI KOLs Timeline ↗ · 2026-05-19 Cached

NVIDIA Model Optimizer is a library that compresses deep learning models using techniques like quantization, distillation, pruning, and speculative decoding to accelerate inference. It supports Hugging Face, PyTorch, and ONNX models and integrates with NVIDIA inference frameworks.

0 favorites 0 likes

inference-acceleration

Submit Feedback