MTV-Inpaint: 멀티태스크 장편 비디오 인페인팅

초록

비디오 인페인팅은 비디오 내의 특정 영역을 수정하면서 공간적 및 시간적 일관성을 유지하는 작업을 포함합니다. 기존의 대부분의 방법들은 주로 장면 완성(즉, 누락된 영역을 채우는 작업)에 초점을 맞추고 있으며, 새로운 객체를 장면에 삽입하는 작업을 제어 가능한 방식으로 수행하는 능력이 부족합니다. 다행히 최근 텍스트-투-비디오(T2V) 확산 모델의 발전으로 텍스트 기반 비디오 인페인팅이 가능해졌습니다. 그러나 T2V 모델을 직접 인페인팅에 적용하는 것은 완성과 삽입 작업을 통합하는 데 한계가 있으며, 입력 제어성이 부족하고 긴 비디오를 처리하는 데 어려움을 겪어 적용 범위와 유연성이 제한됩니다. 이러한 문제를 해결하기 위해, 우리는 전통적인 장면 완성과 새로운 객체 삽입 작업을 모두 처리할 수 있는 통합 다중 작업 비디오 인페인팅 프레임워크인 MTV-Inpaint를 제안합니다. 이러한 서로 다른 작업을 통합하기 위해, 우리는 T2V 확산 U-Net 내에 이중 분기 공간 주의 메커니즘을 설계하여 단일 프레임워크 내에서 장면 완성과 객체 삽입을 원활하게 통합할 수 있도록 했습니다. 텍스트 기반 지침 외에도, MTV-Inpaint는 제안된 이미지-투-비디오(I2V) 인페인팅 모드를 통해 다양한 이미지 인페인팅 모델을 통합하여 다중 모드 제어를 지원합니다. 또한, 키프레임 인페인팅과 중간 프레임 전파를 결합한 두 단계 파이프라인을 제안하여 MTV-Inpaint가 수백 프레임의 긴 비디오를 효과적으로 처리할 수 있도록 했습니다. 광범위한 실험을 통해 MTV-Inpaint가 장면 완성과 객체 삽입 작업 모두에서 최첨단 성능을 달성함을 입증했습니다. 더 나아가, 다중 모드 인페인팅, 객체 편집, 제거, 이미지 객체 브러시 및 긴 비디오 처리 능력과 같은 파생 응용 프로그램에서도 다재다능함을 보여주었습니다. 프로젝트 페이지: https://mtv-inpaint.github.io/.

English

Video inpainting involves modifying local regions within a video, ensuring spatial and temporal consistency. Most existing methods focus primarily on scene completion (i.e., filling missing regions) and lack the capability to insert new objects into a scene in a controllable manner. Fortunately, recent advancements in text-to-video (T2V) diffusion models pave the way for text-guided video inpainting. However, directly adapting T2V models for inpainting remains limited in unifying completion and insertion tasks, lacks input controllability, and struggles with long videos, thereby restricting their applicability and flexibility. To address these challenges, we propose MTV-Inpaint, a unified multi-task video inpainting framework capable of handling both traditional scene completion and novel object insertion tasks. To unify these distinct tasks, we design a dual-branch spatial attention mechanism in the T2V diffusion U-Net, enabling seamless integration of scene completion and object insertion within a single framework. In addition to textual guidance, MTV-Inpaint supports multimodal control by integrating various image inpainting models through our proposed image-to-video (I2V) inpainting mode. Additionally, we propose a two-stage pipeline that combines keyframe inpainting with in-between frame propagation, enabling MTV-Inpaint to effectively handle long videos with hundreds of frames. Extensive experiments demonstrate that MTV-Inpaint achieves state-of-the-art performance in both scene completion and object insertion tasks. Furthermore, it demonstrates versatility in derived applications such as multi-modal inpainting, object editing, removal, image object brush, and the ability to handle long videos. Project page: https://mtv-inpaint.github.io/.

MTV-Inpaint: 멀티태스크 장편 비디오 인페인팅

MTV-Inpaint: Multi-Task Long Video Inpainting

초록

Support