VOID: 비디오 객체 및 상호작용 삭제

초록

기존 비디오 객체 제거 방법은 객체 "뒤편"의 콘텐츠를 복원하고 그림자, 반사 등의 외관 수준 아티팩트를 보정하는 데 뛰어난 성능을 보입니다. 그러나 제거 대상 객체가 다른 객체와의 충돌과 같은 보다 중대한 상호작용을 할 경우, 현재 모델들은 이를 보정하지 못하고 비현실적인 결과를 생성합니다. 본 논문은 이러한 복잡한 시나리오에서 물리적으로 타당한 인페인팅을 수행하도록 설계된 비디오 객체 제거 프레임워크인 VOID를 제안합니다. 모델 학습을 위해 Kubric과 HUMOTO를 활용하여 객체를 제거하면 이후의 물리적 상호작용을 변경해야 하는 새로운 대조적 객체 제거 페어 데이터셋을 생성했습니다. 추론 단계에서는 비전-언어 모델이 제거된 객체의 영향을 받은 장면 영역을 식별합니다. 이 영역들은 이후 비디오 확산 모델을 안내하여 물리적으로 일관된 대조적 결과를 생성하는 데 사용됩니다. 합성 및 실제 데이터에 대한 실험 결과, 우리의 접근 방식이 기존 비디오 객체 제거 방법 대비 객체 제거 후 일관된 장면 역학을 더 잘 보존함을 확인했습니다. 본 프레임워크가 고수준 인과 추론을 통해 비디오 편집 모델이 세계를 더 잘 시뮬레이션하는 방법에 대한 통찰을 제공하기를 기대합니다.

English

Existing video object removal methods excel at inpainting content "behind" the object and correcting appearance-level artifacts such as shadows and reflections. However, when the removed object has more significant interactions, such as collisions with other objects, current models fail to correct them and produce implausible results. We present VOID, a video object removal framework designed to perform physically-plausible inpainting in these complex scenarios. To train the model, we generate a new paired dataset of counterfactual object removals using Kubric and HUMOTO, where removing an object requires altering downstream physical interactions. During inference, a vision-language model identifies regions of the scene affected by the removed object. These regions are then used to guide a video diffusion model that generates physically consistent counterfactual outcomes. Experiments on both synthetic and real data show that our approach better preserves consistent scene dynamics after object removal compared to prior video object removal methods. We hope this framework sheds light on how to make video editing models better simulators of the world through high-level causal reasoning.

VOID: 비디오 객체 및 상호작용 삭제

VOID: Video Object and Interaction Deletion

초록

Support