Tag
Introduces Speculative Refinement (SpecRef), a training-free hybrid decoding strategy that warm-starts a masked diffusion language model from an autoregressive draft using entropy-guided selective masking. Evaluated across six benchmarks, it reveals that code benchmarks conflate structural discovery with logical correctness, identifies a refinement tension phenomenon, and shows that evaluation protocols can produce different model rankings.
MultiHashFormer is a hash-based generative language model that represents each token as a unique hash signature, enabling parameter-efficient autoregression. It outperforms standard Transformer LMs at 100M, 1B, and 3B scales and supports multilingual vocabulary expansion without increasing parameters.
The paper proposes Nemotron-TwoTower, a diffusion language model that decouples context representation and denoising using a frozen autoregressive tower and a trainable diffusion denoiser, achieving 98.7% of baseline quality with 2.42x throughput.
Parallel Rollout Approximation (PRA) improves pixel-space autoregressive image generation by using low-dimensional intermediate states and parallel training, achieving new state-of-the-art results on ImageNet-1K generation.
This paper investigates training-time data augmentation techniques to mitigate overfitting in autoregressive language model pretraining under data-constrained, compute-abundant regimes, finding that combining token-level noise, sequence permutations, and target offset prediction improves validation loss.
This paper proposes E³RL, a reinforcement learning method that uses dynamic epistemic entropy thresholds to enable LLMs to excise local logical defects during generation, overcoming the autoregressive curse in long-horizon reasoning and achieving state-of-the-art results on mathematical reasoning benchmarks like AIME.
This paper presents a discrete autoregressive transformer that generates planar mechanisms from target coupler curves, using variational autoencoder latents and tokenized joint coordinates to achieve diverse, accurate designs across multiple topologies.
MaineCoon is a 22B-parameter real-time audio-visual autoregressive model for social world modeling, capable of streaming generation at up to 47.5 FPS on a single GPU, introducing novel training techniques and an agentic inference framework.
UniAR presents a unified autoregressive framework that uses a single discrete visual tokenizer to bridge visual understanding and generation, achieving state-of-the-art results in image generation and editing.
LOGOS is a scientific generative language model that encodes diverse scientific objects and spatial interactions as token sequences, enabling a unified autoregressive framework for tasks across natural sciences. Models at 1B, 3B, and 8B parameters show consistent performance scaling and are released to facilitate research.
FadeMem introduces a distance-aware key-value memory consolidation mechanism that organizes historical video data into a temporal hierarchy, improving long-video generation under fixed cache constraints.
This paper studies data-constrained language model pretraining, proposing masked-input regularization (MIR) to improve validation loss and downstream performance, and SoftQ, a scaling law that better captures model-data interaction under repeated data.
Stream3D-VLM is an online 3D vision-language model that enables real-time spatial understanding from streaming video by incrementally integrating geometry priors and using geometry-adaptive voxel compression, outperforming existing models on 3D spatial understanding tasks.
StreamForce is a causal, unified video generation model that provides real-time, physically grounded responses to time-varying forces through a distillation pipeline and autoregressive architecture, achieving state-of-the-art performance in force adherence and motion realism.
dots.tts presents a 2B-parameter continuous autoregressive TTS model trained on multilingual data, achieving state-of-the-art performance on benchmarks like Seed-TTS-Eval with low-latency streaming via CFG-aware MeanFlow distillation. The model, code, and checkpoints are released under Apache 2.0.
Echo-Infinity introduces a learnable evolving memory mechanism for autoregressive video generation, enabling real-time infinite video generation with constant memory cost and state-of-the-art performance.
MeshWeaver presents an autoregressive mesh generation framework that directly predicts vertices using a multi-level sparse-voxel encoder, achieving state-of-the-art compression and geometric fidelity for high-poly meshes.
NEPA is a new method for visual self-supervised learning and generative pretraining that predicts the next embedding autoregressively, and has been added to a benchmark for evaluation.
Steady-Forcing proposes a memory and training framework to balance spatial stability and motion continuity in long-horizon nature video generation, improving background consistency while sustaining fluid dynamics over multi-minute rollouts.
AAD-1 introduces asymmetric adversarial distillation with phased training to achieve one-step autoregressive video generation, outperforming prior methods on VBench.