Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance

Hugging Face Daily Papers Papers

Summary

This paper introduces Sparkle, a new dataset and benchmark for instruction-guided video background replacement, addressing the lack of high-quality training data in this domain. It proposes a scalable pipeline with decoupled guidance to generate realistic foreground-background interactions.

In recent years, open-source efforts like Senorita-2M have propelled video editing toward natural language instruction. However, current publicly available datasets predominantly focus on local editing or style transfer, which largely preserve the original scene structure and are easier to scale. In contrast, Background Replacement, a task central to creative applications such as film production and advertising, requires synthesizing entirely new, temporally consistent scenes while maintaining accurate foreground-background interactions, making large-scale data generation significantly more challenging. Consequently, this complex task remains largely underexplored due to a scarcity of high-quality training data. This gap is evident in poorly performing state-of-the-art models, e.g., Kiwi-Edit, because the primary open-source dataset that contains this task, i.e., OpenVE-3M, frequently produces static, unnatural backgrounds. In this paper, we trace this quality degradation to a lack of precise background guidance during data synthesis. Accordingly, we design a scalable pipeline that generates foreground and background guidance in a decoupled manner with strict quality filtering. Building on this pipeline, we introduce Sparkle, a dataset of ~140K video pairs spanning five common background-change themes, alongside Sparkle-Bench, the largest evaluation benchmark tailored for background replacement to date. Experiments demonstrate that our dataset and the model trained on it achieve substantially better performance than all existing baselines on both OpenVE-Bench and Sparkle-Bench. Our proposed dataset, benchmark, and model are fully open-sourced at https://showlab.github.io/Sparkle/.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/08/26, 10:54 AM

Paper page - Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance

Source: https://huggingface.co/papers/2605.06535

Abstract

A new dataset and benchmark for background replacement in video editing are introduced, addressing limitations in existing datasets through a scalable pipeline with improved guidance mechanisms.

In recent years, open-source efforts like Senorita-2M have propelledvideo editingtowardnatural language instruction. However, current publicly available datasets predominantly focus on local editing or style transfer, which largely preserve the original scene structure and are easier to scale. In contrast,Background Replacement, a task central to creative applications such as film production and advertising, requires synthesizing entirely new, temporally consistent scenes while maintaining accurateforeground-background interactions, making large-scale data generation significantly more challenging. Consequently, this complex task remains largely underexplored due to a scarcity of high-quality training data. This gap is evident in poorly performing state-of-the-art models, e.g., Kiwi-Edit, because the primary open-source dataset that contains this task, i.e., OpenVE-3M, frequently produces static, unnatural backgrounds. In this paper, we trace this quality degradation to a lack of precise background guidance duringdata synthesis. Accordingly, we design a scalable pipeline that generates foreground and background guidance in a decoupled manner with strictquality filtering. Building on this pipeline, we introduce Sparkle, a dataset of ~140Kvideo pairsspanning five common background-change themes, alongside Sparkle-Bench, the largestevaluation benchmarktailored forbackground replacementto date. Experiments demonstrate that our dataset and the model trained on it achieve substantially better performance than all existing baselines on both OpenVE-Bench and Sparkle-Bench. Our proposed dataset, benchmark, and model are fully open-sourced at https://showlab.github.io/Sparkle/.

View arXiv pageView PDFProject pageGitHub1Add to collection

Get this paper in your agent:

hf papers read 2605\.06535

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### stdKonjac/Kiwi-Sparkle-720P-81F 5B• Updatedabout 2 hours ago

Datasets citing this paper2

#### stdKonjac/Sparkle Viewer• Updatedabout 3 hours ago • 137k • 15 #### stdKonjac/Sparkle-Bench Viewer• Updatedabout 3 hours ago • 458 • 9

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.06535 in a Space README.md to link it from this page.

Collections including this paper1

Similar Articles

Boosting Visual Instruction Tuning with Self-Supervised Guidance

Hugging Face Daily Papers

This paper proposes augmenting visual instruction tuning in multimodal language models with self-supervised tasks expressed as natural language instructions, improving vision-centric reasoning without additional architecture or annotations. By reformulating classical self-supervised pretext tasks as image-instruction-response triplets, the method achieves consistent performance improvements across multiple benchmarks by injecting only 3-10% visually grounded instructions into the training data.

Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction

Hugging Face Daily Papers

Re2Pix is a hierarchical video prediction framework that improves future video generation by first predicting semantic representations using frozen vision foundation models, then conditioning a latent diffusion model on these predictions to generate photorealistic frames. The approach addresses train-test mismatches through nested dropout and mixed supervision strategies, achieving improved temporal semantic consistency and perceptual quality on autonomous driving benchmarks.