VOID：视频对象与交互消除

摘要

现有视频物体移除方法在修复物体"后方"内容及校正阴影、反光等表观层面伪影方面表现卓越。然而当被移除物体存在更显著的交互行为（如与其他物体发生碰撞）时，当前模型无法修正此类物理互动，导致生成结果有违常理。我们提出VOID视频物体移除框架，专门针对此类复杂场景进行物理合理的内容修复。为训练模型，我们基于Kubric和HUMOTO构建了新的反事实物体移除配对数据集，其中移除物体需要相应改变后续的物理互动过程。在推理阶段，视觉语言模型首先识别受移除物体影响的场景区域，继而引导视频扩散模型生成物理一致的反事实结果。合成数据与真实数据的实验表明，相较于现有视频物体移除方法，本方案能更好地保持物体移除后场景动力学的一致性。我们期望该框架能通过高层级因果推理，为提升视频编辑模型的世界模拟能力提供新思路。

English

Existing video object removal methods excel at inpainting content "behind" the object and correcting appearance-level artifacts such as shadows and reflections. However, when the removed object has more significant interactions, such as collisions with other objects, current models fail to correct them and produce implausible results. We present VOID, a video object removal framework designed to perform physically-plausible inpainting in these complex scenarios. To train the model, we generate a new paired dataset of counterfactual object removals using Kubric and HUMOTO, where removing an object requires altering downstream physical interactions. During inference, a vision-language model identifies regions of the scene affected by the removed object. These regions are then used to guide a video diffusion model that generates physically consistent counterfactual outcomes. Experiments on both synthetic and real data show that our approach better preserves consistent scene dynamics after object removal compared to prior video object removal methods. We hope this framework sheds light on how to make video editing models better simulators of the world through high-level causal reasoning.

VOID：视频对象与交互消除

VOID: Video Object and Interaction Deletion

摘要

Support