autoregressive-decoding

#autoregressive-decoding

SpecLA: Efficient Speculative Decoding for Linear-Attention Models

arXiv cs.CL ↗ · 12h ago Cached

SpecLA proposes a speculative decoding runtime tailored for stateful linear-attention models, achieving up to 1.70x end-to-end speedup over autoregressive decoding on an NVIDIA H100 with a GDN-1.3B target.

0 favorites 0 likes

#autoregressive-decoding

KVpop -- Key-Value Cache Compression with Predictive Online Pruning

Hugging Face Daily Papers ↗ · 2026-07-06 Cached

KVpop introduces a learned KV cache eviction policy supervised by future-attention targets, achieving high compression rates (e.g., 98% performance at 75% compression) on Qwen3 models while maintaining quality.

0 favorites 0 likes

#autoregressive-decoding

Parallelized Autoregressive Decoding for Omni-Modal Dense Video Captioning

Hugging Face Daily Papers ↗ · 2026-07-03 Cached

This paper introduces PadCaptioner, a 3B parameter model for omni-modal dense video captioning that uses parallelized autoregressive decoding to achieve high efficiency and quality, outperforming 7B counterparts. A latent planning mechanism enables lossless parallel generation by exploiting weak local dependencies among events.

0 favorites 0 likes

#autoregressive-decoding

From Architecture to Output: Structural Origins of Hallucination in Large Language Models and the Amplifying Role of Data

arXiv cs.AI ↗ · 2026-06-11 Cached

This paper analyzes hallucination in large language models as a structural consequence of three architectural decisions: self-attention's co-occurrence learning, maximum likelihood estimation training objective, and autoregressive decoding's left-to-right commitment. It maps each mechanism to specific hallucination types and argues that dataset pathologies amplify but do not cause these vulnerabilities.

0 favorites 0 likes

#autoregressive-decoding

Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction

Hugging Face Daily Papers ↗ · 2026-06-04 Cached

Introduces Future-L1, an interleaved latent visual reasoning framework that improves video event prediction by maintaining visual semantics in latent space. Achieves state-of-the-art results on FutureBench and TwiFF-Bench benchmarks.

0 favorites 0 likes

#autoregressive-decoding

KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks

Hugging Face Daily Papers ↗ · 2026-06-02 Cached

KVarN is a calibration-free KV-cache quantizer that uses Hadamard rotation and dual-scaling variance normalization to reduce error accumulation during autoregressive decoding in large language models, achieving state-of-the-art 2-bit precision on reasoning benchmarks.

0 favorites 0 likes

#autoregressive-decoding

@NVIDIAAI: Most language models only generate one token at a time. We just released Nemotron-Labs-Diffusion, a family of diffusion…

X AI KOLs Following ↗ · 2026-05-19 Cached

NVIDIA released Nemotron-Labs-Diffusion, a family of diffusion language models that generate multiple tokens in parallel, enabling faster inference and better GPU utilization, with sizes from 3B to 14B including vision-language variants.

0 favorites 0 likes

#autoregressive-decoding

BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion

arXiv cs.CL ↗ · 2026-05-13 Cached

This paper introduces BitLM, a language model that uses bitwise continuous diffusion to generate multiple tokens in parallel, aiming to overcome the sequential bottleneck of traditional autoregressive generation while preserving causal structure.

0 favorites 0 likes

autoregressive-decoding

Submit Feedback