Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance
Summary
This paper introduces Sparkle, a new dataset and benchmark for instruction-guided video background replacement, addressing the lack of high-quality training data in this domain. It proposes a scalable pipeline with decoupled guidance to generate realistic foreground-background interactions.
View Cached Full Text
Cached at: 05/08/26, 10:54 AM
Paper page - Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance
Source: https://huggingface.co/papers/2605.06535
Abstract
A new dataset and benchmark for background replacement in video editing are introduced, addressing limitations in existing datasets through a scalable pipeline with improved guidance mechanisms.
In recent years, open-source efforts like Senorita-2M have propelledvideo editingtowardnatural language instruction. However, current publicly available datasets predominantly focus on local editing or style transfer, which largely preserve the original scene structure and are easier to scale. In contrast,Background Replacement, a task central to creative applications such as film production and advertising, requires synthesizing entirely new, temporally consistent scenes while maintaining accurateforeground-background interactions, making large-scale data generation significantly more challenging. Consequently, this complex task remains largely underexplored due to a scarcity of high-quality training data. This gap is evident in poorly performing state-of-the-art models, e.g., Kiwi-Edit, because the primary open-source dataset that contains this task, i.e., OpenVE-3M, frequently produces static, unnatural backgrounds. In this paper, we trace this quality degradation to a lack of precise background guidance duringdata synthesis. Accordingly, we design a scalable pipeline that generates foreground and background guidance in a decoupled manner with strictquality filtering. Building on this pipeline, we introduce Sparkle, a dataset of ~140Kvideo pairsspanning five common background-change themes, alongside Sparkle-Bench, the largestevaluation benchmarktailored forbackground replacementto date. Experiments demonstrate that our dataset and the model trained on it achieve substantially better performance than all existing baselines on both OpenVE-Bench and Sparkle-Bench. Our proposed dataset, benchmark, and model are fully open-sourced at https://showlab.github.io/Sparkle/.
View arXiv pageView PDFProject pageGitHub1Add to collection
Get this paper in your agent:
hf papers read 2605\.06535
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper1
#### stdKonjac/Kiwi-Sparkle-720P-81F 5B• Updatedabout 2 hours ago
Datasets citing this paper2
#### stdKonjac/Sparkle Viewer• Updatedabout 3 hours ago • 137k • 15 #### stdKonjac/Sparkle-Bench Viewer• Updatedabout 3 hours ago • 458 • 9
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.06535 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
Boosting Visual Instruction Tuning with Self-Supervised Guidance
This paper proposes augmenting visual instruction tuning in multimodal language models with self-supervised tasks expressed as natural language instructions, improving vision-centric reasoning without additional architecture or annotations. By reformulating classical self-supervised pretext tasks as image-instruction-response triplets, the method achieves consistent performance improvements across multiple benchmarks by injecting only 3-10% visually grounded instructions into the training data.
SPARK: Self-Play with Asymmetric Reward from Knowledge Graphs
This paper introduces SPARK, a self-play reinforcement learning framework that leverages knowledge graphs derived from scientific literature to improve relational reasoning in vision-language models.
ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis
ReImagine introduces an image-first approach to controllable high-quality human video generation, combining SMPL-X motion guidance with video diffusion models to decouple appearance from temporal consistency.
Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation
Stream-R1 introduces a reliability-perplexity aware reward distillation framework for streaming video generation that adaptively weights supervision to improve visual and motion quality without additional computational overhead.
Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction
Re2Pix is a hierarchical video prediction framework that improves future video generation by first predicting semantic representations using frozen vision foundation models, then conditioning a latent diffusion model on these predictions to generate photorealistic frames. The approach addresses train-test mismatches through nested dropout and mixed supervision strategies, achieving improved temporal semantic consistency and perceptual quality on autonomous driving benchmarks.