DecMem: Towards Minute-Long Consistent World Generation with Decoupled Memory

Hugging Face Daily Papers Papers

Summary

DecMem introduces a decoupled memory architecture with Sparse Global Memory and Anchored Local Memory to achieve consistent minute-long video generation, outperforming state-of-the-art methods.

Recent advances in video generative models have promoted rapid progress in controllable world models. However, maintaining fine-grained spatio-temporal consistency under long-horizon reasoning remains a key challenge. In this work, we move beyond explicit 3D memory and coarse frame-level implicit modeling, and propose a fine-grained, learnable, and scalable memory for consistent world generation. We first identify two fundamental limitations of naïve learnable memory architectures in long-horizon extrapolation, namely computational inefficiency and attention dispersion. Through a systematic analysis of attention dispersion, we propose DecMem, a decoupled memory architecture that employs Sparse Global Memory for efficient fine-grained access to global history and Anchored Local Memory for stable and high-quality extrapolation. Extensive experiments demonstrate that DecMem significantly outperforms current state-of-the-art methods. By ensuring precise and efficient long-term memory and achieving superior extrapolation capabilities, DecMem enables minute-level controllable long video generation with high fidelity and consistency.
Original Article
View Cached Full Text

Cached at: 06/01/26, 11:20 AM

Paper page - DecMem: Towards Minute-Long Consistent World Generation with Decoupled Memory

Source: https://huggingface.co/papers/2605.31336

Abstract

A novel decoupled memory architecture called DecMem is introduced for consistent long-horizon video generation, addressing computational inefficiency and attention dispersion issues in learnable memory systems.

Recent advances invideo generative modelshave promoted rapid progress in controllableworld models. However, maintaining fine-grainedspatio-temporal consistencyunderlong-horizon reasoningremains a key challenge. In this work, we move beyond explicit 3D memory and coarse frame-level implicit modeling, and propose a fine-grained, learnable, and scalable memory for consistent world generation. We first identify two fundamental limitations of naïvelearnable memoryarchitectures in long-horizonextrapolation, namely computational inefficiency andattention dispersion. Through a systematic analysis ofattention dispersion, we propose DecMem, a decoupled memory architecture that employsSparse Global Memoryfor efficient fine-grained access to global history andAnchored Local Memoryfor stable and high-qualityextrapolation. Extensive experiments demonstrate that DecMem significantly outperforms current state-of-the-art methods. By ensuring precise and efficient long-term memory and achieving superiorextrapolationcapabilities, DecMem enables minute-level controllable longvideo generationwith high fidelity and consistency.

View arXiv pageView PDFProject pageGitHub3Add to collection

Get this paper in your agent:

hf papers read 2605\.31336

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### KlingTeam/DecMem Video-to-Video• Updatedabout 4 hours ago • 2

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.31336 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.31336 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

SimpleMem: Efficient Lifelong Memory for LLM Agents

Papers with Code Trending

Introduces SimpleMem, an efficient memory framework for LLM agents that uses semantic lossless compression to improve accuracy and reduce token consumption, achieving 26.4% F1 improvement and up to 30x reduction in inference-time token usage.

Long Video Generation (4 minute read)

TLDR AI

The article introduces A²RD, a novel architecture for generating consistent long videos using agentic autoregressive diffusion. It proposes a Retrieve–Synthesize–Refine–Update cycle and a new benchmark, LVBench-C, to address semantic drift in long-horizon video synthesis.