Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation
Summary
Echo-Forcing introduces a scene memory framework for interactive long video generation, using hierarchical temporal memory, scene recall frames, and difference-aware memory decay to handle prompt switching and long-term recall. The method is training-free and achieves strong performance on VBench-Long.
View Cached Full Text
Cached at: 05/20/26, 02:35 AM
Paper page - Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation
Source: https://huggingface.co/papers/2605.16003 Authors:
,
,
,
,
,
,
,
,
,
Abstract
Echo-Forcing addresses limitations in interactive long-video generation by decoupling historical memory and recent dynamics through hierarchical temporal memory, scene recall frames, and difference-aware memory decay mechanisms.
Autoregressive video diffusion modelsenable open-ended generation throughlocal attentionandKV caching. However, existingtraining-free long-video optimizationmethods mainly focus on stable extension under a single prompt, making them difficult to handle interactive scenarios involving prompt switching, old scene forgetting, and historical scene recall. We identify the core bottleneck as thefunctional entanglementofhistorical KV states:stable anchorsandrecent dynamicsare handled by the same cache policy, leading to outdated background contamination, delayed response to new prompts, and loss oflong-range memory. To address this issue, we propose Echo-Forcing, a training-free scene memory framework specifically designed for interactive long video generation with three core mechanisms: (1) Hierarchical Temporal Memory, which decouplesstable anchors, compressed history, and recent windows underrelative RoPE; (2)Scene Recall Frames, which compresses historical scenes into spatially structured KV representations to support long-term recall; and (3)Difference-aware Memory Decay, which adaptively forgets conflicting tokens according to the discrepancy between old and new scenes. Based on these designs, Echo-Forcing uniformly supports smooth transitions, hard cuts, and long-range scene recall under a bounded cache budget. Extensive evaluations onVBench-Longfurther demonstrate that Echo-Forcing achieves the best overall performance in both long-video generation and interactive video generation settings. Our code is released in https://github.com/mingqiangWu/Echo-Forcing
View arXiv pageView PDFGitHub15Add to collection
Get this paper in your agent:
hf papers read 2605\.16003
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.16003 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.16003 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.16003 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Echo-Infinity: Learning Evolving Memory for Real-Time Infinite Video Generation
Echo-Infinity introduces a learnable evolving memory mechanism for autoregressive video generation, enabling real-time infinite video generation with constant memory cost and state-of-the-art performance.
Long Video Generation (4 minute read)
The article introduces A²RD, a novel architecture for generating consistent long videos using agentic autoregressive diffusion. It proposes a Retrieve–Synthesize–Refine–Update cycle and a new benchmark, LVBench-C, to address semantic drift in long-horizon video synthesis.
LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation
LongLive-RAG formulates long video generation as a retrieval-augmented generation problem, using a dynamic memory of previously generated latents to reduce error accumulation and identity drift, achieving improved quality across multiple autoregressive backbones.
EverMemOS: A Self-Organizing Memory Operating System for Structured Long-Horizon Reasoning
EverMemOS is a self-organizing memory operating system for large language models that enhances long-horizon reasoning by structuring dialogue into memory cells and scenes.
S3Mem: Structured Spatiotemporal Scene-Event Memory for Long-Horizon Interactive Question Answering
S3Mem proposes a structured spatiotemporal scene-event memory framework for long-horizon interactive question answering, using anchor-sensitive retrieval and token-budget-aware evidence interface to outperform standard RAG in multiple environments.