VOID: 映像オブジェクト・インタラクション削除

要旨

既存のビデオオブジェクト除去手法は、オブジェクトの「背後」にあるコンテンツの修復や、影や反射といった見た目レベルの人工物の修正において優れた性能を発揮する。しかし、除去対象のオブジェクトが他のオブジェクトとの衝突など、より重要な相互作用を持つ場合、現在のモデルはそれらを修正できず、不自然な結果を生成してしまう。本研究では、このような複雑なシナリオにおいて物理的に妥当な修復を実現するビデオオブジェクト除去フレームワーク「VOID」を提案する。モデルの学習には、KubricとHUMOTOを用いて新たに構築した反事実的オブジェクト除去のペアデータセットを利用する。このデータセットでは、オブジェクトを除去すると下流の物理的相互作用を変更する必要が生じる。推論時には、視覚言語モデルが除去対象のオブジェクトの影響を受けたシーン領域を特定する。これらの領域は、物理的に一貫性のある反事実的結果を生成するビデオ拡散モデルを誘導するために用いられる。合成データと実データの両方を用いた実験により、従来手法と比較して、提案手法がオブジェクト除去後の一貫したシーンダイナミクスをより良好に保持することを示す。本フレームワークが、高水準の因果推論を通じてビデオ編集モデルをより優れた世界シミュレータとする方法に示唆を与えることを期待する。

English

Existing video object removal methods excel at inpainting content "behind" the object and correcting appearance-level artifacts such as shadows and reflections. However, when the removed object has more significant interactions, such as collisions with other objects, current models fail to correct them and produce implausible results. We present VOID, a video object removal framework designed to perform physically-plausible inpainting in these complex scenarios. To train the model, we generate a new paired dataset of counterfactual object removals using Kubric and HUMOTO, where removing an object requires altering downstream physical interactions. During inference, a vision-language model identifies regions of the scene affected by the removed object. These regions are then used to guide a video diffusion model that generates physically consistent counterfactual outcomes. Experiments on both synthetic and real data show that our approach better preserves consistent scene dynamics after object removal compared to prior video object removal methods. We hope this framework sheds light on how to make video editing models better simulators of the world through high-level causal reasoning.

VOID: 映像オブジェクト・インタラクション削除

VOID: Video Object and Interaction Deletion

要旨

Support