Sparkle: 分離されたガイダンスによる活気ある指示誘導型ビデオ背景置換の実現

要旨

近年、Senorita-2Mのようなオープンソースの取り組みにより、ビデオ編集は自然言語指示による操作へと進化している。しかし、現在公開されているデータセットの大半は、ローカル編集やスタイル転送に焦点を当てたものが主流であり、これらは元のシーン構造をほぼ維持するため、大規模化が比較的容易である。一方、映画制作や広告といった創造的な応用において中心的なタスクである背景置換は、正確な前景-背景の相互作用を維持しつつ、全く新しい時間的に一貫性のあるシーンを合成することを要求するため、大規模なデータ生成が格段に困難となる。その結果、高品質な訓練データの不足から、この複雑なタスクは未開拓の状態が続いている。この問題は、最先端モデル（例：Kiwi-Edit）の低い性能に如実に表れており、このタスクを含む主要なオープンソースデータセットであるOpenVE-3Mが、静的で不自然な背景を頻繁に生成する原因となっている。本論文では、この品質低下の原因を、データ合成における精密な背景ガイダンスの欠如にあると特定する。これに基づき、厳格な品質フィルタリングを施した分離方式で前景と背景のガイダンスを生成する、拡張性の高いパイプラインを設計した。このパイプラインに基づき、5つの一般的な背景変更テーマを網羅する約14万のビデオペアからなるデータセット「Sparkle」と、背景置換に特化した過去最大の評価ベンチマーク「Sparkle-Bench」を導入する。実験により、我々のデータセットおよびそれで訓練したモデルが、OpenVE-BenchとSparkle-Benchの両方において、既存の全てのベースラインを大幅に上回る性能を達成することを実証する。提案するデータセット、ベンチマーク、モデルはhttps://showlab.github.io/Sparkle/ で完全にオープンソースとして公開されている。

English

In recent years, open-source efforts like Senorita-2M have propelled video editing toward natural language instruction. However, current publicly available datasets predominantly focus on local editing or style transfer, which largely preserve the original scene structure and are easier to scale. In contrast, Background Replacement, a task central to creative applications such as film production and advertising, requires synthesizing entirely new, temporally consistent scenes while maintaining accurate foreground-background interactions, making large-scale data generation significantly more challenging. Consequently, this complex task remains largely underexplored due to a scarcity of high-quality training data. This gap is evident in poorly performing state-of-the-art models, e.g., Kiwi-Edit, because the primary open-source dataset that contains this task, i.e., OpenVE-3M, frequently produces static, unnatural backgrounds. In this paper, we trace this quality degradation to a lack of precise background guidance during data synthesis. Accordingly, we design a scalable pipeline that generates foreground and background guidance in a decoupled manner with strict quality filtering. Building on this pipeline, we introduce Sparkle, a dataset of ~140K video pairs spanning five common background-change themes, alongside Sparkle-Bench, the largest evaluation benchmark tailored for background replacement to date. Experiments demonstrate that our dataset and the model trained on it achieve substantially better performance than all existing baselines on both OpenVE-Bench and Sparkle-Bench. Our proposed dataset, benchmark, and model are fully open-sourced at https://showlab.github.io/Sparkle/.

Sparkle: 分離されたガイダンスによる活気ある指示誘導型ビデオ背景置換の実現

Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance

要旨

Support