ROSE: 비디오에서 부작용을 동반한 객체 제거

초록

비디오 객체 제거 기술은 최근 비디오 생성 모델의 성공으로 인해 고도화된 성능을 달성했습니다. 그러나 객체의 부수적 효과, 예를 들어 그림자와 반사 등을 처리할 때, 기존 연구들은 지도 학습을 위한 짝지어진 비디오 데이터의 부족으로 인해 이러한 효과를 제거하는 데 어려움을 겪습니다. 본 논문은 ROSE(Remove Objects with Side Effects)라는 프레임워크를 제안하며, 이는 객체가 환경에 미치는 영향을 체계적으로 연구하여 그림자, 반사, 조명, 반투명성 및 거울 효과라는 다섯 가지 일반적인 경우로 분류합니다. 앞서 언급한 효과를 보여주는 짝지어진 비디오를 큐레이션하는 데 따른 어려움을 고려하여, 우리는 합성 데이터 생성을 위해 3D 렌더링 엔진을 활용합니다. 우리는 다양한 장면, 객체, 촬영 각도 및 카메라 궤적을 시뮬레이션한 대규모 짝지어진 데이터셋을 구축하기 위해 완전 자동화된 데이터 준비 파이프라인을 신중하게 구성합니다. ROSE는 디퓨전 트랜스포머 기반의 비디오 인페인팅 모델로 구현됩니다. 객체와 관련된 모든 영역을 지역화하기 위해 전체 비디오가 모델에 입력되어 참조 기반 삭제가 수행됩니다. 또한, 짝지어진 비디오 간의 차이 마스크를 통해 드러나는 부수적 효과가 미치는 영역을 명시적으로 예측하기 위해 추가적인 지도 학습이 도입됩니다. 다양한 부수적 효과 제거에 대한 모델 성능을 완전히 조사하기 위해, 우리는 일반적인 시나리오와 다섯 가지 특수 부수적 효과를 포함한 새로운 벤치마크인 ROSE-Bench를 제안합니다. 실험 결과는 ROSE가 기존의 비디오 객체 삭제 모델에 비해 우수한 성능을 달성하며 실제 비디오 시나리오에 잘 일반화됨을 보여줍니다. 프로젝트 페이지는 https://rose2025-inpaint.github.io/에서 확인할 수 있습니다.

English

Video object removal has achieved advanced performance due to the recent success of video generative models. However, when addressing the side effects of objects, e.g., their shadows and reflections, existing works struggle to eliminate these effects for the scarcity of paired video data as supervision. This paper presents ROSE, termed Remove Objects with Side Effects, a framework that systematically studies the object's effects on environment, which can be categorized into five common cases: shadows, reflections, light, translucency and mirror. Given the challenges of curating paired videos exhibiting the aforementioned effects, we leverage a 3D rendering engine for synthetic data generation. We carefully construct a fully-automatic pipeline for data preparation, which simulates a large-scale paired dataset with diverse scenes, objects, shooting angles, and camera trajectories. ROSE is implemented as an video inpainting model built on diffusion transformer. To localize all object-correlated areas, the entire video is fed into the model for reference-based erasing. Moreover, additional supervision is introduced to explicitly predict the areas affected by side effects, which can be revealed through the differential mask between the paired videos. To fully investigate the model performance on various side effect removal, we presents a new benchmark, dubbed ROSE-Bench, incorporating both common scenarios and the five special side effects for comprehensive evaluation. Experimental results demonstrate that ROSE achieves superior performance compared to existing video object erasing models and generalizes well to real-world video scenarios. The project page is https://rose2025-inpaint.github.io/.