Long Video Generation (4 minute read)

TLDR AI Papers

Summary

The article introduces A²RD, a novel architecture for generating consistent long videos using agentic autoregressive diffusion. It proposes a Retrieve–Synthesize–Refine–Update cycle and a new benchmark, LVBench-C, to address semantic drift in long-horizon video synthesis.

A²RD introduced an agentic autoregressive diffusion framework for generating long coherent videos through iterative retrieval, synthesis, refinement, and memory updates.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/13/26, 12:22 AM

# A²RD: Agentic Autoregressive Diffusion for Long Video Consistency Source: [https://dxlong2000.github.io/AARD/](https://dxlong2000.github.io/AARD/) 1Google Cloud AI Research2National University of Singapore ## Abstract Synthesizing consistent and coherent long video remains a fundamental challenge\. Existing methods suffer from semantic drift and narrative collapse over long horizons\. We present**A²RD**\(/ɑːrd/\), an**A**gentic**A**uto\-**R**egressive**D**iffusion architecture that decouples creative synthesis from consistency enforcement\. A²RD formulates long video synthesis as a closed\-loop process that synthesizes and self\-improves video segment\-by\-segment through a Retrieve–Synthesize–Refine–Update cycle\. It comprises three core components:*\(1\) Multimodal Video Memory*that tracks video progression across modalities;*\(2\) Adaptive Segment Generation*that switches among generation modes for natural progression and visual consistency; and*\(3\) Hierarchical Test\-Time Self\-Improvement*that self\-improves each segment at frame and video levels to prevent error propagation\. We further introduce**LVBench\-C**, a challenging benchmark with non\-linear entity and environment transitions to stress\-test long\-horizon consistency\. Across public and LVBench\-C benchmarks spanning one\- to ten\-minute videos, A²RD outperforms state\-of\-the\-art baselines by up to 30% in consistency and 20% in narrative coherence\. ## Terminology TerminologyDescription**Shot**A continuous sequence of frames captured from a single camera angle without cuts\.**Scene**A narrative unit representing continuous action within a single physical environment or location\.**Segment \(Clip\)**The fundamental generation unit in A²RD, which is flexible and can span one or multiple shots or scenes\.**Segment Context \(𝑆ᵢ\)**The textual description dictating the narrative, actions, and settings for the 𝑖\-th segment\.**Storyline \(𝒮\)**The complete sequential collection of segment contexts \{𝑆₁, …, 𝑆ₙ\} defining the full video narrative\.**Extrapolation**A generation mode that synthesizes a video segment moving forward from only a beginning frame\.**Interpolation**A generation mode that synthesizes a video segment to seamlessly connect a fixed beginning and ending frame\. ## Method Overview A²RD enables video diffusion models to synthesize and self\-improve long videos autoregressively, enforcing temporal consistency and narrative coherence\. A²RD is training\-free and built upon three pillars: - **Multimodal Video Memory**: Existing methods store only visual references, losing narrative context over long horizons\. A²RD stores structured contexts from synthesized segments, disentangling each segment into three modalities:*Textual States*\(entity identities, attribute changes, motions, spatial relations, camera trajectories\),*Frames*\(global references and boundary keyframes\), and*Videos*\(full segments for motion continuity\)\. Online Retrieve and Update operations are enabled for synthesis\. - **Adaptive Segment Generation**: Prior studies adopt either extrapolation or interpolation as a fixed generation mode\. Extrapolation enables natural progression but risks semantic drift; interpolation enforces stronger consistency but risks unnatural video progression when end frames are poorly planned\. A²RD adaptively selects the mode per segment to enable both natural video progression and strong consistency enforcement\. - **Hierarchical Test\-Time Self\-Improvement \(HITS\)**: A single inconsistent frame can cascade artifacts across the entire horizon\. Existing video refinement methods operate only on short clips\. A²RD introduces HITS to self\-improve long videos hierarchically — first boundary frames, then full segments — focusing on intra\- and inter\-segment coherence, and video quality to combat errors propagate uncorrected\. The workflow proceeds in two stages: - **Memory Initialization**: The agent reasons over the narrative to identify entities and environments, constructs a dependency graph, and synthesizes global reference frames as a form of long\-term memory\. - **Autoregressive Segment Synthesis & Self\-Improvement**: For each segment, the agent retrieves context from memory, selects the generation mode, synthesizes boundary frames and video, applies HITS, and updates memory before advancing\. ![A²RD Overview](https://dxlong2000.github.io/AARD/imgs/AARD.png) ## Benchmark: LVBench\-C We introduce**LVBench\-C**\(Long Video Bench\-Challenge\), a challenging benchmark designed to stress\-test temporal consistency under complex scenarios where**entities and environments appear, disappear, and reappear**across long horizons with optional state changes\. LVBench\-C features multi\-shot stories at 3\-minute, 5\-minute, and 10\-minute scales with rich non\-linear entity and environment transitions\. ![LVBench-C Overview](https://dxlong2000.github.io/AARD/imgs/Dataset.png) ## SOTA Segment\-Based Long Video Synthesis Baselines **Single\-scene \(VBench\-Long\):**A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage\. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse\. She wears sunglasses and red lipstick\. She walks confidently and casually\. The street is damp and reflective, creating a mirror effect of the colorful lights\. Many pedestrians walk about\. **Multi\-scene \(LVBench\-C, 3 minutes, The Scuba Diver's Reef Exploration\):**Prompt below\. ## A²RDSingle\-Scene/Multi\-SceneLong Video Gallery ## A²RD Multi\-Scene Ultra\-Long Video Gallery \(a\) 3\-minute: The Master Potter's Creation Scene 1: In a quiet morning living room, a man with a grey ponytail puts on a clean navy blue apron\. Scene 2: He walks into his kitchen and packs a small wooden crate with various carving tools\. Scene 3: He exits his house and walks down a cobblestone alleyway toward his art studio\. Scene 4: Inside the bright studio, he approaches a large bag of wet, grey clay and cuts a large chunk with a wire\. Scene 5: He carries the heavy clay to a pottery wheel and slams it down onto the center of the bat\. Scene 6: The man sits at the wheel and begins centering the clay, his hands quickly becoming coated in thick, wet slip\. Scene 7: As the wheel spins, he pulls the clay upward, forming a tall, elegant vase shape\. Scene 8: He picks up a wet sponge and smooths the exterior of the vase, grey water dripping onto his apron\. Scene 9: He uses a metal rib tool to shave the sides, creating a pile of clay shavings around the base\. Scene 10: The man stops the wheel and uses a thin wire to carefully slice the vase off the spinning head\. Scene 11: He carries the wet vase into a drying room filled with wooden shelves and sets it down gently\. Scene 12: He walks to a workbench and picks up a leather\-hard bowl from the previous day to begin carving\. Scene 13: He uses a fine needle tool to etch intricate patterns into the bowl, clay dust settling on his arms\. Scene 14: The man carries the carved bowl into a kiln room and carefully places it inside the large industrial kiln\. Scene 15: He adjusts the digital settings on the kiln and presses the start button to begin the firing process\. Scene 16: He walks to a glazing station and stirs a bucket of deep blue glaze with a wooden stick\. Scene 17: He dips a finished, fired plate into the blue liquid, his fingers getting stained with the pigment\. Scene 18: He sets the glazed plate on a rack to dry, looking at the transformation of the surface\. Scene 19: The man walks to a large utility sink and begins scrubbing the thick clay from his hands and forearms\. Scene 20: He removes the navy blue apron, which is now heavily stained with grey clay and blue glaze spots\. Scene 21: He hangs the apron on a wall hook and picks up his wooden crate of tools\. Scene 22: He walks back through the cobblestone alley as the evening streetlamps flicker on\. Scene 23: Entering his home, he places the tool crate on the table and sighs with satisfaction\. Scene 24: He stands in his living room stretching leisurely, a look of deep satisfaction on his face\. \(a\) 3\-minute: The Scuba Diver's Reef Exploration Scene 1: A diver stands on the deck of a boat in the ocean\. Scene 2: The diver is wearing a black neoprene wetsuit\. Scene 3: The diver puts on a heavy air tank and harness\. Scene 4: The diver fastens a weight belt around their waist\. Scene 5: The diver sits on the edge of the boat deck\. Scene 6: The diver pulls a rubber mask over their eyes\. Scene 7: The diver puts a regulator mouthpiece in their mouth\. Scene 8: The diver falls backward into the blue water\. Scene 9: The diver sinks beneath the surface of the ocean\. Scene 10: Bubbles rise from the diver's regulator as they breathe\. Scene 11: The diver swims down toward a colorful coral reef\. Scene 12: The diver sees a school of bright tropical fish\. Scene 13: The diver hovers near a large sea turtle\. Scene 14: The diver checks the air pressure gauge on their tank\. Scene 15: The diver begins to swim slowly back to the surface\. Scene 16: The diver breaks the surface of the water\. Scene 17: The diver swims to the ladder on the side of the boat\. Scene 18: The diver climbs up the ladder onto the deck\. Scene 19: The diver removes the rubber mask from their face\. Scene 20: The diver takes the regulator out of their mouth\. Scene 21: The diver removes the heavy air tank and harness\. Scene 22: The diver enters the boat's cabin and changes into dry clothes\. Scene 23: The diver hangs the wet wetsuit on a drying rack\. Scene 24: The boat begins to drive back toward the harbor\. \(b\) 5\-minute: The Stage Fright \(Clara\) Scene 1: Clara, wearing an oversized wool sweater and glasses, sits at a piano in a dusty attic\. Scene 2: Her hair is messy and tied back with a simple rubber band as she hums a melody\. Scene 3: She stops to scribble notes onto a piece of crumpled sheet music with a pencil\. Scene 4: Clara wipes a layer of dust off the piano keys, her fingers trembling slightly\. Scene 5: The grand theater lobby is filled with socialites in tuxedos and evening gowns\. Scene 6: Ushers in gold\-trimmed uniforms hand out glossy programs to the arriving guests\. Scene 7: A large poster in the lobby features a silhouette of a pianist with the name 'CLARA' in bold\. Scene 8: Stagehands move a massive black grand piano into the center of the stage\. Scene 9: The conductor of the orchestra adjusts his baton, looking at his pocket watch\. Scene 10: The audience begins to file into the rows of red velvet seats, whispering in anticipation\. Scene 11\-40: \[Full 40\-scene narrative continues\.\.\.\] \(c\) 10\-minute: The Great Museum Heist Scene 1: Victor and Saffron sit in a dim basement, wearing casual hoodies and jeans\. Scene 2: They study a holographic blueprint of the Royal Museum, glowing blue on the table\. Scene 3: Saffron points to the laser grid in the North Gallery, her eyes narrow and focused\. Scene 4: Victor checks the internal mechanism of a miniature glass\-cutting device\. Scene 5: They clink two mugs of cold coffee together, finalizing their silent pact\. Scene 6: The Royal Museum stands majestic under the moonlight, guarded by tall stone lions\. Scene 7: A security guard walks his patrol, the beam of his flashlight cutting through the dark\. Scene 8: The museum's grand clock strikes midnight, the sound echoing through the empty streets\. Scene 9\-80: \[Full 80\-scene narrative continues\.\.\.\] ## Citation \(BibTeX\)

Similar Articles

Video generation models as world simulators

OpenAI Blog

OpenAI's technical report on Sora describes a video generation model that unifies diverse visual data through visual patches, enabling large-scale training of generative models capable of producing high-definition videos up to one minute long across variable durations, aspect ratios, and resolutions.