闪耀：通过解耦引导实现生动的指令驱动视频背景替换

摘要

近年来，诸如Senorita-2M等开源项目推动了视频编辑向自然语言指令方向发展。然而当前公开数据集主要聚焦于局部编辑或风格转换，这类任务基本保留原始场景结构且易于扩展。相比之下，作为影视制作和广告等创意应用核心任务的背景替换，需要在保持准确前景-背景交互的同时合成全新的时序一致场景，这使得大规模数据生成面临更大挑战。因此，由于高质量训练数据的稀缺，这一复杂任务至今仍未被充分探索。现有顶尖模型（如Kiwi-Edit）表现不佳正凸显了这一缺陷——因为包含该任务的主要开源数据集OpenVE-3M常生成静态、不自然的背景。本文通过溯源发现，质量下降源于数据合成过程中缺乏精确的背景引导。据此，我们设计了可扩展的生成流程，以解耦方式生成前景与背景引导并实施严格质量过滤。基于此流程，我们推出包含约14万视频对的Sparkle数据集，涵盖五种常见背景替换主题，同时发布迄今规模最大的背景替换专项评估基准Sparkle-Bench。实验表明，我们的数据集及基于其训练的模型在OpenVE-Bench和Sparkle-Bench上均显著优于所有现有基线。相关数据集、评估基准与模型已在https://showlab.github.io/Sparkle/ 全面开源。

English

In recent years, open-source efforts like Senorita-2M have propelled video editing toward natural language instruction. However, current publicly available datasets predominantly focus on local editing or style transfer, which largely preserve the original scene structure and are easier to scale. In contrast, Background Replacement, a task central to creative applications such as film production and advertising, requires synthesizing entirely new, temporally consistent scenes while maintaining accurate foreground-background interactions, making large-scale data generation significantly more challenging. Consequently, this complex task remains largely underexplored due to a scarcity of high-quality training data. This gap is evident in poorly performing state-of-the-art models, e.g., Kiwi-Edit, because the primary open-source dataset that contains this task, i.e., OpenVE-3M, frequently produces static, unnatural backgrounds. In this paper, we trace this quality degradation to a lack of precise background guidance during data synthesis. Accordingly, we design a scalable pipeline that generates foreground and background guidance in a decoupled manner with strict quality filtering. Building on this pipeline, we introduce Sparkle, a dataset of ~140K video pairs spanning five common background-change themes, alongside Sparkle-Bench, the largest evaluation benchmark tailored for background replacement to date. Experiments demonstrate that our dataset and the model trained on it achieve substantially better performance than all existing baselines on both OpenVE-Bench and Sparkle-Bench. Our proposed dataset, benchmark, and model are fully open-sourced at https://showlab.github.io/Sparkle/.

闪耀：通过解耦引导实现生动的指令驱动视频背景替换

Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance

摘要

Support