Memento: Reconstruct to Remember for Consistent Long Video Generation

Hugging Face Daily Papers 06/12/26, 12:00 AM Papers

long-video-generation subject-consistency memory-based-reconstruction dual-query-mechanism autoregressive-generation video-generation temporal-decomposition

Summary

Memento is a subject-reconstruction-guided framework that improves long-form video generation by preserving recurring subjects through memory-based reconstruction and dual-query mechanisms, achieving state-of-the-art performance in long-term subject consistency and cross-shot coherence.

Long-form video generation requires recurring subjects to remain consistent across various shots, viewpoints, motions, and scene transitions. Existing temporal decomposition methods improve scalability by generating videos shot by shot. However, they mainly focus on optimizing plausible next-shot continuations without verifying whether the historical memory preserves identity-critical subject evidence. Consequently, as generation proceeds, recurring subjects may be diluted, overwritten, or forgotten. In this paper, we propose Memento, a subject-reconstruction-guided framework that treats subject preservation as an explicit identity grounding problem, based on the premise that a memory bank faithfully preserving a subject should support reconstructing that subject from memory alone. Specifically, Memento jointly trains autoregressive next-shot generation with memory-based subject reconstruction, recovering target appearances using historical memory and global story captions. To disentangle long-range subject evidence from short-range cues, Memento introduces a dual-query memory mechanism, where one query retrieves identity-relevant memory and the other selects short-context keyframes for coherent continuation. Additionally, a subject-aware cinematic data pipeline provides precise reconstruction supervision via consistent, pronoun-free subject descriptions. Experiments demonstrate that Memento achieves state-of-the-art performance in long-term subject consistency, cross-shot coherence, and visual quality.

Original Article

View Cached Full Text

Cached at: 06/16/26, 11:31 AM

Paper page - Memento: Reconstruct to Remember for Consistent Long Video Generation

Source: https://huggingface.co/papers/2606.14667

Abstract

Memento is a subject-reconstruction-guided framework that improves long-form video generation by preserving recurring subjects through memory-based reconstruction and dual-query mechanisms.

Long-form video generation requires recurring subjects to remain consistent across various shots, viewpoints, motions, and scene transitions. Existingtemporal decompositionmethods improve scalability by generating videos shot by shot. However, they mainly focus on optimizing plausible next-shot continuations without verifying whether the historical memory preserves identity-critical subject evidence. Consequently, as generation proceeds, recurring subjects may be diluted, overwritten, or forgotten. In this paper, we propose Memento, a subject-reconstruction-guided framework that treats subject preservation as an explicit identity grounding problem, based on the premise that amemory bankfaithfully preserving a subject should support reconstructing that subject from memory alone. Specifically, Memento jointly trainsautoregressive next-shot generationwith memory-basedsubject reconstruction, recovering target appearances using historical memory and global story captions. To disentangle long-range subject evidence from short-range cues, Memento introduces adual-query memory mechanism, where one query retrieves identity-relevant memory and the other selects short-context keyframes for coherent continuation. Additionally, a subject-awarecinematic data pipelineprovides precise reconstruction supervision via consistent, pronoun-free subject descriptions. Experiments demonstrate that Memento achieves state-of-the-art performance inlong-term subject consistency,cross-shot coherence, andvisual quality.

View arXiv page View PDF Project page GitHub1 Add to collection

Get this paper in your agent:

hf papers read 2606\.14667

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### ernie-research/Memento Text-to-Video• Updatedabout 2 hours ago • 19 • 3

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.14667 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.14667 in a Space README.md to link it from this page.

Memento: Reconstruct to Remember for Consistent Long Video Generation

Paper page - Memento: Reconstruct to Remember for Consistent Long Video Generation

Abstract

Models citing this paper1

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper2

Similar Articles

PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory

DecMem: Towards Minute-Long Consistent World Generation with Decoupled Memory

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

FadeMem: Distance-Aware Memory Consolidation for Autoregressive Video Diffusion

MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation

Submit Feedback

Similar Articles

PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory

DecMem: Towards Minute-Long Consistent World Generation with Decoupled Memory

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

FadeMem: Distance-Aware Memory Consolidation for Autoregressive Video Diffusion

MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation