UnityShots: Memory-Driven Multi-Shot Audio-Video Generation with Boundary-Aware Gating

Hugging Face Daily Papers 06/19/26, 12:00 AM Papers

multi-shot audio-video generation memory-driven coherence open-source ltx-2.3

Summary

UnityShots is a memory-driven multi-shot audio-video generation system that maintains consistent subject appearance and audio across video cuts using fixed-size long-term and short-term memory slots with boundary-conditioned gates and discrete cut-type priors. It outperforms open-source baselines on cross-shot coherence metrics and matches closed-source systems.

Generating a coherent multi-shot video requires structured cross-shot memory. Subject appearance, scene context, and speaker identity must persist across cuts. Existing approaches either train end-to-end over fixed-length sequences and cannot scale, generate shot-by-shot with memory banks that grow linearly, or orchestrate pretrained generators under an LLM planner without a multi-shot-aware backbone. We present UnityShots, a memory-driven multi-shot audio-video generation system built on LTX-2.3, trained on annotated cinematic and music-video shots. The video stream maintains two fixed-size slots, a long-term memory (LTM) slot anchored to the opening shot and a short-term memory (STM) slot holding the immediately preceding tail, both updated at every cut by a boundary-conditioned gate that fuses visual cut probability and beat-tracker signals. The audio stream injects a reference speaker token at every shot to preserve vocal timbre without a sliding audio bank. A discrete cut-type prior, learned through AdaLN, becomes an inference-time control knob over transition strength. We release a benchmark of 200 multi-cultural multi-shot sequences spanning six ethnic regions and ten or more languages, with per-shot reference identities, reference audio, and per-boundary transition labels. Evaluated across I2V, T2V, and R2V conditioning modes, UnityShots leads open-source baselines on every cross-shot coherence metric and matches the strongest closed-source system on the multi-shot axes.

Original Article

View Cached Full Text

Cached at: 06/25/26, 09:11 AM

Paper page - UnityShots: Memory-Driven Multi-Shot Audio-Video Generation with Boundary-Aware Gating

Source: https://huggingface.co/papers/2606.21661

Abstract

UnityShots is a memory-driven audio-video generation system that maintains consistent subject appearance and audio across video cuts using fixed-size long-term and short-term memory slots with boundary-conditioned gates and discrete cut-type priors.

Generating a coherent multi-shot video requires structured cross-shot memory. Subject appearance, scene context, and speaker identity must persist across cuts. Existing approaches either train end-to-end over fixed-length sequences and cannot scale, generate shot-by-shot with memory banks that grow linearly, or orchestrate pretrained generators under an LLM planner without a multi-shot-aware backbone. We present UnityShots, a memory-drivenmulti-shot audio-video generationsystem built onLTX-2.3, trained on annotated cinematic and music-video shots. The video stream maintains two fixed-size slots, along-term memory(LTM) slot anchored to the opening shot and ashort-term memory(STM) slot holding the immediately preceding tail, both updated at every cut by aboundary-conditioned gatethat fusesvisual cut probabilityandbeat-tracker signals. The audio stream injects areference speaker tokenat every shot to preserve vocal timbre without a sliding audio bank. Adiscrete cut-type prior, learned throughAdaLN, becomes an inference-time control knob over transition strength. We release a benchmark of 200 multi-cultural multi-shot sequences spanning six ethnic regions and ten or more languages, with per-shot reference identities, reference audio, and per-boundary transition labels. Evaluated across I2V, T2V, and R2V conditioning modes, UnityShots leads open-source baselines on everycross-shot coherencemetric and matches the strongest closed-source system on the multi-shot axes.

View arXiv page View PDF Project page GitHub2 Add to collection

Get this paper in your agent:

hf papers read 2606\.21661

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.21661 in a model README.md to link it from this page.

Datasets citing this paper1

#### KlingTeam/UnityShotsBench Viewer• Updatedabout 20 hours ago • 1.43k • 2 • 4

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.21661 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

UnityShots: Memory-Driven Multi-Shot Audio-Video Generation with Boundary-Aware Gating

Paper page - UnityShots: Memory-Driven Multi-Shot Audio-Video Generation with Boundary-Aware Gating

Abstract

Models citing this paper0

Datasets citing this paper1

Spaces citing this paper0

Collections including this paper0

Similar Articles

MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation

MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation

Memento: Reconstruct to Remember for Consistent Long Video Generation

Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation

UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors

Submit Feedback

Similar Articles

MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation

MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation

Memento: Reconstruct to Remember for Consistent Long Video Generation

Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation

UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors