parallel-decoding

#parallel-decoding

Dynamic-dLLM: Dynamic Cache-Budget and Adaptive Parallel Decoding for Training-Free Acceleration of Diffusion LLM

arXiv cs.CL ↗ · 2d ago Cached

This paper proposes Dynamic-dLLM, a training-free framework that accelerates diffusion large language models by dynamically allocating cache-update budgets and calibrating decoding thresholds, achieving over 3x speedup on models like LLaDA and Dream while maintaining performance.

0 favorites 0 likes

#parallel-decoding

What is Speculative Decoding? (trending on paperswithco.de) [R]

Reddit r/MachineLearning ↗ · 2026-06-17

Speculative decoding is an inference optimization technique that uses a fast draft model to propose future tokens verified in parallel by a larger model, improving LLM generation speed. The article highlights its trending status on Papers with Code and a recent SGLang blog post about state-of-the-art latencies using DFlash models.

0 favorites 0 likes

#parallel-decoding

PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models

Hugging Face Daily Papers ↗ · 2026-06-17 Cached

PerceptionDLM introduces a multimodal diffusion language model that enables parallel region perception via structured attention masking and efficient prompting, achieving faster inference without sacrificing caption quality. Experiments show competitive performance with substantial speed improvements for multi-region perception tasks.

0 favorites 0 likes

#parallel-decoding

Why might DiffusionGemma be better at tool calls than its benchmark quality suggests

Reddit r/LocalLLaMA ↗ · 2026-06-16

Analyzes how DiffusionGemma's bidirectional attention and parallel block generation could potentially yield higher valid tool call rates due to its ability to revise tokens, even though its base quality is lower than Gemma 4.

0 favorites 0 likes

#parallel-decoding

MARS: Margin-Adversarial Risk-controlled Stopping for Parallel LLM Test-time Scaling

arXiv cs.AI ↗ · 2026-06-12 Cached

This paper introduces MARS, a stopping rule for parallel LLM test-time scaling that probes partial traces to stop early without sacrificing accuracy, saving 25–47% of tokens across reasoning models on competition math benchmarks.

0 favorites 0 likes

#parallel-decoding

Supportive Token Revealing for Fast Diffusion Language Model Decoding

arXiv cs.CL ↗ · 2026-06-04 Cached

This paper proposes AXON, a training-free module that improves the quality-latency trade-off of discrete diffusion language model decoding by intelligently selecting 'anchor' tokens to reveal first, using attention, uncertainty, and confidence signals to support subsequent denoising steps. Experiments on reasoning and code-generation benchmarks show AXON reduces function evaluations while maintaining or improving accuracy.

0 favorites 0 likes

#parallel-decoding

@VincentLogic: NVIDIA's newly open-sourced LocateAnything model is really impressive. The previous visual grounding models generated coordinates digit by digit (like squeezing toothpaste), slow and unstable. This new model uses "parallel bounding box decoding" to predict complete coordinates in one step, much faster and more accurate...

X AI KOLs Timeline ↗ · 2026-06-03 Cached

NVIDIA has open-sourced the LocateAnything model, using parallel bounding box decoding technology to predict complete coordinates in one step, fast and accurate. The model has only 3B parameters and can run on consumer-grade GPUs, supporting video object localization, UI recognition, OCR, and other tasks.

0 favorites 0 likes

#parallel-decoding

Fast-dLLM++: Fr\'{e}chet Profile Decoding for Faster Diffusion LLM Inference

arXiv cs.CL ↗ · 2026-06-03 Cached

Fast-dLLM++ introduces Fréchet profile decoding for diffusion LLMs, a training-free method that selects parallel commit sets based on heterogeneous confidence profiles, achieving up to 37% higher throughput at comparable accuracy on benchmarks with LLaDA-8B.

0 favorites 0 likes

#parallel-decoding

EPIC: Efficient and Parallel Inference under CFG Constraints for Diffusion Language Models

arXiv cs.CL ↗ · 2026-06-02 Cached

This paper presents EPIC, an efficient framework for context-free grammar constrained decoding in diffusion language models that reduces inference time by up to 67.5% while maintaining syntactic correctness.

0 favorites 0 likes

#parallel-decoding

Efficient Diffusion LLMs via Temporal-Spatial Parallel Decoding and Confidence Extrapolation

arXiv cs.CL ↗ · 2026-06-01 Cached

This paper introduces Temporal-Spatial Parallel Decoding (TSPD) and Confidence Extrapolation (CE) to accelerate inference in diffusion-based large language models by dynamically deciding when tokens have converged and forecasting logit trends, reducing unnecessary denoising steps while preserving output quality.

0 favorites 0 likes

#parallel-decoding

@ZhidingYu: Thank you NVIDIA! I will be presenting LocateAnything at #CVPR2026 at the NVIDIA Booth: June 5 4:20 - 4:40 pm MDT (Frid…

X AI KOLs Following ↗ · 2026-05-28 Cached

NVIDIA introduces LocateAnything, a unified generative grounding and detection framework that uses Parallel Box Decoding to improve decoding throughput and localization accuracy. This work will be presented at CVPR 2026.

0 favorites 0 likes

#parallel-decoding

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

Hugging Face Daily Papers ↗ · 2026-05-26 Cached

LocateAnything proposes Parallel Box Decoding for unified visual grounding and object detection, decoding geometric elements as atomic units to improve throughput and localization accuracy, supported by a large-scale dataset of 138M samples.

0 favorites 0 likes

#parallel-decoding

Towards Speed-of-Light Text Generation with Nemotron-Labs Diffusion Language Models

Hugging Face Blog ↗ · 2026-05-23 Cached

NVIDIA introduces Nemotron-Labs Diffusion, a family of diffusion language models that generate text in parallel and iteratively refine it, offering faster generation and the ability to revise previous tokens.

0 favorites 0 likes

#parallel-decoding

@NVIDIAAI: Most language models only generate one token at a time. We just released Nemotron-Labs-Diffusion, a family of diffusion…

X AI KOLs Following ↗ · 2026-05-19 Cached

NVIDIA released Nemotron-Labs-Diffusion, a family of diffusion language models that generate multiple tokens in parallel, enabling faster inference and better GPU utilization, with sizes from 3B to 14B including vision-language variants.

0 favorites 0 likes

#parallel-decoding

Roll Out and Roll Back: Diffusion LLMs are Their Own Efficiency Teachers

arXiv cs.CL ↗ · 2026-05-19 Cached

This paper introduces WINO and WINO+, methods that enable revokable parallel decoding in diffusion LLMs and distill efficient denoising trajectories, significantly improving the quality-speed trade-off.

0 favorites 0 likes

#parallel-decoding

PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding

arXiv cs.CL ↗ · 2026-05-18 Cached

This paper introduces Parallel Speculative Decoding (PSD), a training-free framework that accelerates diffusion LLM inference by jointly improving spatial and temporal efficiency, achieving up to 5.5× tokens per forward pass with comparable quality to greedy decoding.

0 favorites 0 likes

#parallel-decoding

@DivyanshT91162: Autoregressive LLMs might already be getting replaced Someone built dLLM — an open-source library that can turn ANY aut…

X AI KOLs Timeline ↗ · 2026-05-16 Cached

dLLM is an open-source library that converts any autoregressive LLM into a diffusion LLM, enabling parallel decoding and faster text generation.

0 favorites 0 likes

#parallel-decoding

Orthrus-Qwen3-8B : up to 7.8×tokens/forward on Qwen3-8B, frozen backbone, provably identical output distribution

Reddit r/LocalLLaMA ↗ · 2026-05-15

Introduces Orthrus, a method that injects a trainable diffusion attention module into a frozen autoregressive transformer to achieve up to 7.8× tokens per forward pass and ~6× wall-clock speedup on MATH-500, with provably identical output distribution to the base Qwen3-8B model. The approach requires minimal additional parameters and training, and avoids the TTFT penalty of external drafters.

0 favorites 0 likes

#parallel-decoding

LEAP: Unlocking dLLM Parallelism via Lookahead Early-Convergence Token Detection

arXiv cs.LG ↗ · 2026-05-13 Cached

This paper introduces LEAP, a training-free method to accelerate inference in Diffusion Language Models (dLLMs) by detecting early-converging tokens, reducing denoising steps by 30% without losing accuracy.

0 favorites 0 likes

#parallel-decoding

@JulieKallini: Fast Byte Latent Transformer is accepted to ICML 2026! Byte-level LMs promise to free us from subword tokenizers, but d…

X AI KOLs Following ↗ · 2026-05-11 Cached

The Fast Byte Latent Transformer (BLT-D) has been accepted to ICML 2026, introducing a text diffusion method for parallel byte-level decoding to overcome the speed limitations of traditional byte-level language models.

0 favorites 0 likes

parallel-decoding

Submit Feedback