Long Video Generation (4 minute read)

TLDR AI 05/12/26, 12:00 AM Papers

video-generation consistency diffusion-models long-video research benchmark

Summary

The article introduces A²RD, a novel architecture for generating consistent long videos using agentic autoregressive diffusion. It proposes a Retrieve–Synthesize–Refine–Update cycle and a new benchmark, LVBench-C, to address semantic drift in long-horizon video synthesis.

A²RD introduced an agentic autoregressive diffusion framework for generating long coherent videos through iterative retrieval, synthesis, refinement, and memory updates.

Original Article

View Cached Full Text

Cached at: 05/13/26, 12:22 AM

# A²RD: Agentic Autoregressive Diffusion for Long Video Consistency Source: [https://dxlong2000.github.io/AARD/](https://dxlong2000.github.io/AARD/) 1Google Cloud AI Research2National University of Singapore ## Abstract Synthesizing consistent and coherent long video remains a fundamental challenge\. Existing methods suffer from semantic drift and narrative collapse over long horizons\. We present**A²RD**\(/ɑːrd/\), an**A**gentic**A**uto\-**R**egressive**D**iffusion architecture that decouples creative synthesis from consistency enforcement\. A²RD formulates long video synthesis as a closed\-loop process that synthesizes and self\-improves video segment\-by\-segment through a Retrieve–Synthesize–Refine–Update cycle\. It comprises three core components:*\(1\) Multimodal Video Memory*that tracks video progression across modalities;*\(2\) Adaptive Segment Generation*that switches among generation modes for natural progression and visual consistency; and*\(3\) Hierarchical Test\-Time Self\-Improvement*that self\-improves each segment at frame and video levels to prevent error propagation\. We further introduce**LVBench\-C**, a challenging benchmark with non\-linear entity and environment transitions to stress\-test long\-horizon consistency\. Across public and LVBench\-C benchmarks spanning one\- to ten\-minute videos, A²RD outperforms state\-of\-the\-art baselines by up to 30% in consistency and 20% in narrative coherence\. ## Terminology TerminologyDescription**Shot**A continuous sequence of frames captured from a single camera angle without cuts\.**Scene**A narrative unit representing continuous action within a single physical environment or location\.**Segment \(Clip\)**The fundamental generation unit in A²RD, which is flexible and can span one or multiple shots or scenes\.**Segment Context \(𝑆ᵢ\)**The textual description dictating the narrative, actions, and settings for the 𝑖\-th segment\.**Storyline \(𝒮\)**The complete sequential collection of segment contexts \{𝑆₁, …, 𝑆ₙ\} defining the full video narrative\.**Extrapolation**A generation mode that synthesizes a video segment moving forward from only a beginning frame\.**Interpolation**A generation mode that synthesizes a video segment to seamlessly connect a fixed beginning and ending frame\. ## Method Overview A²RD enables video diffusion models to synthesize and self\-improve long videos autoregressively, enforcing temporal consistency and narrative coherence\. A²RD is training\-free and built upon three pillars: - **Multimodal Video Memory**: Existing methods store only visual references, losing narrative context over long horizons\. A²RD stores structured contexts from synthesized segments, disentangling each segment into three modalities:*Textual States*\(entity identities, attribute changes, motions, spatial relations, camera trajectories\),*Frames*\(global references and boundary keyframes\), and*Videos*\(full segments for motion continuity\)\. Online Retrieve and Update operations are enabled for synthesis\. - **Adaptive Segment Generation**: Prior studies adopt either extrapolation or interpolation as a fixed generation mode\. Extrapolation enables natural progression but risks semantic drift; interpolation enforces stronger consistency but risks unnatural video progression when end frames are poorly planned\. A²RD adaptively selects the mode per segment to enable both natural video progression and strong consistency enforcement\. - **Hierarchical Test\-Time Self\-Improvement \(HITS\)**: A single inconsistent frame can cascade artifacts across the entire horizon\. Existing video refinement methods operate only on short clips\. A²RD introduces HITS to self\-improve long videos hierarchically — first boundary frames, then full segments — focusing on intra\- and inter\-segment coherence, and video quality to combat errors propagate uncorrected\. The workflow proceeds in two stages: - **Memory Initialization**: The agent reasons over the narrative to identify entities and environments, constructs a dependency graph, and synthesizes global reference frames as a form of long\-term memory\. - **Autoregressive Segment Synthesis & Self\-Improvement**: For each segment, the agent retrieves context from memory, selects the generation mode, synthesizes boundary frames and video, applies HITS, and updates memory before advancing\. ![A²RD Overview](https://dxlong2000.github.io/AARD/imgs/AARD.png) ## Benchmark: LVBench\-C We introduce**LVBench\-C**\(Long Video Bench\-Challenge\), a challenging benchmark designed to stress\-test temporal consistency under complex scenarios where**entities and environments appear, disappear, and reappear**across long horizons with optional state changes\. LVBench\-C features multi\-shot stories at 3\-minute, 5\-minute, and 10\-minute scales with rich non\-linear entity and environment transitions\. ![LVBench-C Overview](https://dxlong2000.github.io/AARD/imgs/Dataset.png) ## SOTA Segment\-Based Long Video Synthesis Baselines **Single\-scene \(VBench\-Long\):**A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage\. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse\. She wears sunglasses and red lipstick\. She walks confidently and casually\. The street is damp and reflective, creating a mirror effect of the colorful lights\. Many pedestrians walk about\. **Multi\-scene \(LVBench\-C, 3 minutes, The Scuba Diver's Reef Exploration\):**Prompt below\. ## A²RDSingle\-Scene/Multi\-SceneLong Video Gallery ## A²RD Multi\-Scene Ultra\-Long Video Gallery \(a\) 3\-minute: The Master Potter's Creation Scene 1: In a quiet morning living room, a man with a grey ponytail puts on a clean navy blue apron\. Scene 2: He walks into his kitchen and packs a small wooden crate with various carving tools\. Scene 3: He exits his house and walks down a cobblestone alleyway toward his art studio\. Scene 4: Inside the bright studio, he approaches a large bag of wet, grey clay and cuts a large chunk with a wire\. Scene 5: He carries the heavy clay to a pottery wheel and slams it down onto the center of the bat\. Scene 6: The man sits at the wheel and begins centering the clay, his hands quickly becoming coated in thick, wet slip\. Scene 7: As the wheel spins, he pulls the clay upward, forming a tall, elegant vase shape\. Scene 8: He picks up a wet sponge and smooths the exterior of the vase, grey water dripping onto his apron\. Scene 9: He uses a metal rib tool to shave the sides, creating a pile of clay shavings around the base\. Scene 10: The man stops the wheel and uses a thin wire to carefully slice the vase off the spinning head\. Scene 11: He carries the wet vase into a drying room filled with wooden shelves and sets it down gently\. Scene 12: He walks to a workbench and picks up a leather\-hard bowl from the previous day to begin carving\. Scene 13: He uses a fine needle tool to etch intricate patterns into the bowl, clay dust settling on his arms\. Scene 14: The man carries the carved bowl into a kiln room and carefully places it inside the large industrial kiln\. Scene 15: He adjusts the digital settings on the kiln and presses the start button to begin the firing process\. Scene 16: He walks to a glazing station and stirs a bucket of deep blue glaze with a wooden stick\. Scene 17: He dips a finished, fired plate into the blue liquid, his fingers getting stained with the pigment\. Scene 18: He sets the glazed plate on a rack to dry, looking at the transformation of the surface\. Scene 19: The man walks to a large utility sink and begins scrubbing the thick clay from his hands and forearms\. Scene 20: He removes the navy blue apron, which is now heavily stained with grey clay and blue glaze spots\. Scene 21: He hangs the apron on a wall hook and picks up his wooden crate of tools\. Scene 22: He walks back through the cobblestone alley as the evening streetlamps flicker on\. Scene 23: Entering his home, he places the tool crate on the table and sighs with satisfaction\. Scene 24: He stands in his living room stretching leisurely, a look of deep satisfaction on his face\. \(a\) 3\-minute: The Scuba Diver's Reef Exploration Scene 1: A diver stands on the deck of a boat in the ocean\. Scene 2: The diver is wearing a black neoprene wetsuit\. Scene 3: The diver puts on a heavy air tank and harness\. Scene 4: The diver fastens a weight belt around their waist\. Scene 5: The diver sits on the edge of the boat deck\. Scene 6: The diver pulls a rubber mask over their eyes\. Scene 7: The diver puts a regulator mouthpiece in their mouth\. Scene 8: The diver falls backward into the blue water\. Scene 9: The diver sinks beneath the surface of the ocean\. Scene 10: Bubbles rise from the diver's regulator as they breathe\. Scene 11: The diver swims down toward a colorful coral reef\. Scene 12: The diver sees a school of bright tropical fish\. Scene 13: The diver hovers near a large sea turtle\. Scene 14: The diver checks the air pressure gauge on their tank\. Scene 15: The diver begins to swim slowly back to the surface\. Scene 16: The diver breaks the surface of the water\. Scene 17: The diver swims to the ladder on the side of the boat\. Scene 18: The diver climbs up the ladder onto the deck\. Scene 19: The diver removes the rubber mask from their face\. Scene 20: The diver takes the regulator out of their mouth\. Scene 21: The diver removes the heavy air tank and harness\. Scene 22: The diver enters the boat's cabin and changes into dry clothes\. Scene 23: The diver hangs the wet wetsuit on a drying rack\. Scene 24: The boat begins to drive back toward the harbor\. \(b\) 5\-minute: The Stage Fright \(Clara\) Scene 1: Clara, wearing an oversized wool sweater and glasses, sits at a piano in a dusty attic\. Scene 2: Her hair is messy and tied back with a simple rubber band as she hums a melody\. Scene 3: She stops to scribble notes onto a piece of crumpled sheet music with a pencil\. Scene 4: Clara wipes a layer of dust off the piano keys, her fingers trembling slightly\. Scene 5: The grand theater lobby is filled with socialites in tuxedos and evening gowns\. Scene 6: Ushers in gold\-trimmed uniforms hand out glossy programs to the arriving guests\. Scene 7: A large poster in the lobby features a silhouette of a pianist with the name 'CLARA' in bold\. Scene 8: Stagehands move a massive black grand piano into the center of the stage\. Scene 9: The conductor of the orchestra adjusts his baton, looking at his pocket watch\. Scene 10: The audience begins to file into the rows of red velvet seats, whispering in anticipation\. Scene 11\-40: \[Full 40\-scene narrative continues\.\.\.\] \(c\) 10\-minute: The Great Museum Heist Scene 1: Victor and Saffron sit in a dim basement, wearing casual hoodies and jeans\. Scene 2: They study a holographic blueprint of the Royal Museum, glowing blue on the table\. Scene 3: Saffron points to the laser grid in the North Gallery, her eyes narrow and focused\. Scene 4: Victor checks the internal mechanism of a miniature glass\-cutting device\. Scene 5: They clink two mugs of cold coffee together, finalizing their silent pact\. Scene 6: The Royal Museum stands majestic under the moonlight, guarded by tall stone lions\. Scene 7: A security guard walks his patrol, the beam of his flashlight cutting through the dark\. Scene 8: The museum's grand clock strikes midnight, the sound echoing through the empty streets\. Scene 9\-80: \[Full 80\-scene narrative continues\.\.\.\] ## Citation \(BibTeX\)

Long Video Generation (4 minute read)

Similar Articles

A^2RD: Agentic Autoregressive Diffusion for Long Video Consistency

LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation

AAD-1: Asymmetric Adversarial Distillation for One-Step Autoregressive Video Generation

Rethinking RAG in Long Videos: What to Retrieve and How to Use It?

@yukangchen_: We released a blog on "Why Video Gen Is an Infra Problem". https://research.nvidia.com/labs/eai/blogs/video-gen-is-an-i…

Submit Feedback

Similar Articles

A^2RD: Agentic Autoregressive Diffusion for Long Video Consistency

LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation

AAD-1: Asymmetric Adversarial Distillation for One-Step Autoregressive Video Generation

Rethinking RAG in Long Videos: What to Retrieve and How to Use It?

@yukangchen_: We released a blog on "Why Video Gen Is an Infra Problem". https://research.nvidia.com/labs/eai/blogs/video-gen-is-an-i…