A^2RD: Agentic Autoregressive Diffusion for Long Video Consistency
Summary
A^2RD is a new paper introducing an Agentic Autoregressive Diffusion architecture for long video synthesis, achieving improved consistency and narrative coherence through a closed-loop self-improvement process.
View Cached Full Text
Cached at: 05/11/26, 02:42 AM
Paper page - A^2RD: Agentic Autoregressive Diffusion for Long Video Consistency
Source: https://huggingface.co/papers/2605.06924
Abstract
A^2RD, an Agentic Auto-Regressive Diffusion architecture, addresses long video synthesis challenges through a closed-loop process with memory tracking, adaptive generation, and hierarchical self-improvement mechanisms.
Synthesizing consistent and coherent long video remains a fundamental challenge. Existing methods suffer fromsemantic driftandnarrative collapseover long horizons. We present A^2RD, anAgentic Auto-Regressive Diffusionarchitecture that decouples creative synthesis from consistency enforcement. A^2RD formulates longvideo synthesisas aclosed-loop processthat synthesizes and self-improves video segment-by-segment through a Retrieve--Synthesize--Refine--Update cycle. It comprises three core components: (i)Multimodal Video Memorythat tracks video progression across modalities; (ii)Adaptive Segment Generationthat switches among generation modes for natural progression and visual consistency; and (iii)Hierarchical Test-Time Self-Improvementthat self-improves each segment at frame and video levels to prevent error propagation. We further introduce LVBench-C, a challenging benchmark with non-linear entity and environment transitions to stress-testlong-horizon consistency. Across public and LVBench-C benchmarks spanning one- to ten-minute videos, A^2RD outperforms state-of-the-art baselines by up to 30% in consistency and 20% in narrative coherence. Human evaluations corroborate these gains while also highlighting notable improvements in motion and transition smoothness.
View arXiv pageView PDFProject pageGitHub1Add to collection
Get this paper in your agent:
hf papers read 2605\.06924
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.06924 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.06924 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.06924 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Long Video Generation (4 minute read)
The article introduces A²RD, a novel architecture for generating consistent long videos using agentic autoregressive diffusion. It proposes a Retrieve–Synthesize–Refine–Update cycle and a new benchmark, LVBench-C, to address semantic drift in long-horizon video synthesis.
AAD-1: Asymmetric Adversarial Distillation for One-Step Autoregressive Video Generation
AAD-1 introduces asymmetric adversarial distillation with phased training to achieve one-step autoregressive video generation, outperforming prior methods on VBench.
Causal-rCM: A Unified Teacher-Forcing and Self-Forcing Open Recipe for Autoregressive Diffusion Distillation in Streaming Video Generation and Interactive World Models
This paper introduces Causal-rCM, a unified teacher-forcing and self-forcing framework for autoregressive diffusion distillation in streaming video generation and interactive world models, achieving state-of-the-art performance with fast convergence.
Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation
Causal Forcing++ presents a novel causal consistency distillation method for frame-wise autoregressive video generation, achieving state-of-the-art quality with reduced latency and training cost.
AdaState: Self-Evolving Anchors for Streaming Video Generation
This paper introduces AdaState, a method that replaces the static first-frame anchor in autoregressive video diffusion models with an adaptive state that evolves with the generated content, enabling richer motion and natural scene progression in streaming video generation.