MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation
Summary
MuSS introduces a large-scale dataset and benchmark for multi-shot subject-to-video generation, addressing narrative logic and copy-paste issues in cinematic storytelling.
View Cached Full Text
Cached at: 05/12/26, 07:30 AM
Paper page - MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation
Source: https://huggingface.co/papers/2604.23789
Abstract
MuSS is a large-scale dual-track dataset designed for multi-shot video generation that addresses narrative logic, spatiotemporal alignment, and copy-paste issues in subject-to-video generation through a progressive captioning pipeline and cross-shot matching mechanism.
While video foundation models excel at single-shot generation, real-world cinematic storytelling inherently relies on complex multi-shot sequencing. Further progress is constrained by the absence of datasets that address three core challenges: authenticnarrative logic,spatiotemporal text-video alignmentconflicts, and the “copy-paste” dilemma prevalent inSubject-to-Video (S2V) generation. To bridge this gap, we introduce MuSS, a large-scale,dual-track datasettailored for multi-shot video and S2V generation. Sourced from over 3,000 movies, MuSS explicitly supports both complex montage transitions and subject-centric narratives. To construct this dataset, we pioneer aprogressive captioning pipelinethat eliminates contextual conflicts by ensuring local shot-level accuracy before enforcing global narrative coherence. Crucially, we implement across-shot matching mechanismto fundamentally eradicate the S2V copy-paste shortcut. Alongside the dataset, we propose theCinematic Narrative Benchmark, featuring a visual-logic-driven paradigm and a novelAnti-Copy-Paste Variance (ACP-Var) metricto rigorously assesscontinuous storytellingand3D structural consistency. Extensive experiments demonstrate that while current baselines struggle with continuousnarrative logicor degenerate into trivial 2D sticker generators, our MuSS-augmented model achieves state-of-the-art narrative effectiveness and cross-shot identity preservation.
View arXiv pageView PDFGitHub5Add to collection
Get this paper in your agent:
hf papers read 2604\.23789
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.23789 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.23789 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.23789 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Memento: Reconstruct to Remember for Consistent Long Video Generation
Memento is a subject-reconstruction-guided framework that improves long-form video generation by preserving recurring subjects through memory-based reconstruction and dual-query mechanisms, achieving state-of-the-art performance in long-term subject consistency and cross-shot coherence.
CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives
CausalCine is a new academic framework for real-time, interactive multi-shot video generation that uses causal modeling and dynamic memory routing to improve cross-shot coherence in autoregressive models.
MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation
MSAVBench is the first comprehensive benchmark and adaptive evaluation framework for multi-shot audio-video generation, assessing 19 models across diverse tasks and achieving high alignment with human judgment.
Artifact-Bench: Evaluating MLLMs on Detecting and Assessing the Artifacts of AI-Generated Videos
Artifact-Bench is a comprehensive benchmark that evaluates multimodal large language models on detecting and analyzing artifacts in AI-generated videos, revealing significant limitations and misalignment with human perception.
VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis
VGenST-Bench is a benchmark that uses generative models to actively synthesize controlled spatio-temporal reasoning scenarios, with a multi-agent pipeline and human quality control, to evaluate multimodal large language models.