VOID:视频对象与交互消除
VOID: Video Object and Interaction Deletion
April 2, 2026
作者: Saman Motamed, William Harvey, Benjamin Klein, Luc Van Gool, Zhuoning Yuan, Ta-Ying Cheng
cs.AI
摘要
现有视频物体移除方法在修复物体"后方"内容及校正阴影、反射等表层伪影方面表现卓越。然而当被移除物体存在更显著的交互行为(如与其他物体发生碰撞)时,当前模型难以修正此类物理互动,导致生成结果有违常理。我们提出VOID框架,专为处理这类复杂场景下的物理可信修复而设计。为训练模型,我们基于Kubric和HUMOTO构建了包含反事实物体移除的配对数据集,其中移除物体需同步改变后续物理互动。推理阶段通过视觉语言模型识别受移除物体影响的场景区域,进而引导视频扩散模型生成物理一致的反事实结果。在合成数据与真实数据上的实验表明,相较于现有视频物体移除方法,本方案能更好地保持物体移除后场景动态的一致性。我们期望该框架能通过高层因果推理,为视频编辑模型实现更精准的世界模拟提供新思路。
English
Existing video object removal methods excel at inpainting content "behind" the object and correcting appearance-level artifacts such as shadows and reflections. However, when the removed object has more significant interactions, such as collisions with other objects, current models fail to correct them and produce implausible results. We present VOID, a video object removal framework designed to perform physically-plausible inpainting in these complex scenarios. To train the model, we generate a new paired dataset of counterfactual object removals using Kubric and HUMOTO, where removing an object requires altering downstream physical interactions. During inference, a vision-language model identifies regions of the scene affected by the removed object. These regions are then used to guide a video diffusion model that generates physically consistent counterfactual outcomes. Experiments on both synthetic and real data show that our approach better preserves consistent scene dynamics after object removal compared to prior video object removal methods. We hope this framework sheds light on how to make video editing models better simulators of the world through high-level causal reasoning.