Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance

Hugging Face Daily Papers 05/07/26, 12:00 AM Papers

video-editing background-replacement dataset benchmark instruction-guided open-source

Summary

This paper introduces Sparkle, a new dataset and benchmark for instruction-guided video background replacement, addressing the lack of high-quality training data in this domain. It proposes a scalable pipeline with decoupled guidance to generate realistic foreground-background interactions.

In recent years, open-source efforts like Senorita-2M have propelled video editing toward natural language instruction. However, current publicly available datasets predominantly focus on local editing or style transfer, which largely preserve the original scene structure and are easier to scale. In contrast, Background Replacement, a task central to creative applications such as film production and advertising, requires synthesizing entirely new, temporally consistent scenes while maintaining accurate foreground-background interactions, making large-scale data generation significantly more challenging. Consequently, this complex task remains largely underexplored due to a scarcity of high-quality training data. This gap is evident in poorly performing state-of-the-art models, e.g., Kiwi-Edit, because the primary open-source dataset that contains this task, i.e., OpenVE-3M, frequently produces static, unnatural backgrounds. In this paper, we trace this quality degradation to a lack of precise background guidance during data synthesis. Accordingly, we design a scalable pipeline that generates foreground and background guidance in a decoupled manner with strict quality filtering. Building on this pipeline, we introduce Sparkle, a dataset of ~140K video pairs spanning five common background-change themes, alongside Sparkle-Bench, the largest evaluation benchmark tailored for background replacement to date. Experiments demonstrate that our dataset and the model trained on it achieve substantially better performance than all existing baselines on both OpenVE-Bench and Sparkle-Bench. Our proposed dataset, benchmark, and model are fully open-sourced at https://showlab.github.io/Sparkle/.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 05/08/26, 10:54 AM

Paper page - Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance

Source: https://huggingface.co/papers/2605.06535

Abstract

A new dataset and benchmark for background replacement in video editing are introduced, addressing limitations in existing datasets through a scalable pipeline with improved guidance mechanisms.

In recent years, open-source efforts like Senorita-2M have propelledvideo editingtowardnatural language instruction. However, current publicly available datasets predominantly focus on local editing or style transfer, which largely preserve the original scene structure and are easier to scale. In contrast,Background Replacement, a task central to creative applications such as film production and advertising, requires synthesizing entirely new, temporally consistent scenes while maintaining accurateforeground-background interactions, making large-scale data generation significantly more challenging. Consequently, this complex task remains largely underexplored due to a scarcity of high-quality training data. This gap is evident in poorly performing state-of-the-art models, e.g., Kiwi-Edit, because the primary open-source dataset that contains this task, i.e., OpenVE-3M, frequently produces static, unnatural backgrounds. In this paper, we trace this quality degradation to a lack of precise background guidance duringdata synthesis. Accordingly, we design a scalable pipeline that generates foreground and background guidance in a decoupled manner with strictquality filtering. Building on this pipeline, we introduce Sparkle, a dataset of ~140Kvideo pairsspanning five common background-change themes, alongside Sparkle-Bench, the largestevaluation benchmarktailored forbackground replacementto date. Experiments demonstrate that our dataset and the model trained on it achieve substantially better performance than all existing baselines on both OpenVE-Bench and Sparkle-Bench. Our proposed dataset, benchmark, and model are fully open-sourced at https://showlab.github.io/Sparkle/.

View arXiv page View PDF Project page GitHub1 Add to collection

Get this paper in your agent:

hf papers read 2605\.06535

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### stdKonjac/Kiwi-Sparkle-720P-81F 5B• Updatedabout 2 hours ago

Datasets citing this paper2

#### stdKonjac/Sparkle Viewer• Updatedabout 3 hours ago • 137k • 15 #### stdKonjac/Sparkle-Bench Viewer• Updatedabout 3 hours ago • 458 • 9

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.06535 in a Space README.md to link it from this page.

Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance

Paper page - Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance

Abstract

Models citing this paper1

Datasets citing this paper2

Spaces citing this paper0

Collections including this paper1

Similar Articles

Boosting Visual Instruction Tuning with Self-Supervised Guidance

SPARK: Self-Play with Asymmetric Reward from Knowledge Graphs

ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis

Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation

Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction

Submit Feedback

Similar Articles

Boosting Visual Instruction Tuning with Self-Supervised Guidance

SPARK: Self-Play with Asymmetric Reward from Knowledge Graphs

ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis

Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation

Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction