EffectErase: 고품질 효과 제거를 위한 비디오 객체 제거 및 삽입의 결합

초록

동영상 객체 제거는 움직이는 대상 객체와 그에 따른 변형, 그림자, 반사 등의 시각적 효과를 제거하면서 원활한 배경을 복원하는 것을 목표로 합니다. 최근의 확산 모델 기반 동영상 인페인팅 및 객체 제거 방법들은 객체 자체는 제거할 수 있지만, 이러한 효과들을 지우고 일관성 있는 배경을 합성하는 데는 종종 어려움을 겪습니다. 방법론의 한계를 넘어서, 진전은 다양한 환경에서 흔히 나타나는 객체 효과들을 체계적으로 포착한 포괄적인 데이터셋의 부재로 인해 더욱 지연되고 있습니다. 이를 해결하기 위해 우리는 VOR(Video Object Removal) 대규모 데이터셋을 소개합니다. VOR는 다양한 짝을 이룬 동영상으로 구성되어 있으며, 각각은 대상 객체와 그 효과가 존재하는 동영상과 객체 및 효과가 제거된 대조 동영상, 그리고 해당 객체 마스크를 제공합니다. VOR는 촬영 및 합성 소스로부터 수집된 6만 개의 고품질 동영상 쌍을 포함하며, 5가지 효과 유형을 다루고, 광범위한 객체 범주와 복잡하고 동적인 다중 객체 장면을 아우릅니다. VOR를 기반으로 우리는 EffectErase를 제안합니다. EffectErase는 효과 인식 동영상 객체 제거 방법으로, 상호 학습 구조 내에서 동영상 객체 삽입을 역방향 보조 작업으로 간주합니다. 이 모델은 학습을 영향을 받은 영역에 집중시키고 유연한 작업 전환을 가능하게 하는 작업 인식 영역 안내를 포함합니다. 또한, 효과 영역과 구조적 단서에 대한 상호 보완적 행동과 공유된 위치 파악을 장려하는 삽입-제거 일관성 목표를 사용합니다. VOR로 학습된 EffectErase는 다양한 시나리오에 걸쳐 높은 품질의 동영상 객체 효과 제거 결과를 제공하며, 광범위한 실험에서 우수한 성능을 달성합니다.

English

Video object removal aims to eliminate dynamic target objects and their visual effects, such as deformation, shadows, and reflections, while restoring seamless backgrounds. Recent diffusion-based video inpainting and object removal methods can remove the objects but often struggle to erase these effects and to synthesize coherent backgrounds. Beyond method limitations, progress is further hampered by the lack of a comprehensive dataset that systematically captures common object effects across varied environments for training and evaluation. To address this, we introduce VOR (Video Object Removal), a large-scale dataset that provides diverse paired videos, each consisting of one video where the target object is present with its effects and a counterpart where the object and effects are absent, with corresponding object masks. VOR contains 60K high-quality video pairs from captured and synthetic sources, covers five effects types, and spans a wide range of object categories as well as complex, dynamic multi-object scenes. Building on VOR, we propose EffectErase, an effect-aware video object removal method that treats video object insertion as the inverse auxiliary task within a reciprocal learning scheme. The model includes task-aware region guidance that focuses learning on affected areas and enables flexible task switching. Then, an insertion-removal consistency objective that encourages complementary behaviors and shared localization of effect regions and structural cues. Trained on VOR, EffectErase achieves superior performance in extensive experiments, delivering high-quality video object effect erasing across diverse scenarios.

EffectErase: 고품질 효과 제거를 위한 비디오 객체 제거 및 삽입의 결합

EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing

초록

Support