MTV-Inpaint：多任务长视频修复

摘要

视频修复涉及对视频中的局部区域进行修改，确保空间和时间上的一致性。现有方法大多集中于场景补全（即填充缺失区域），而缺乏以可控方式向场景中插入新对象的能力。幸运的是，近期文本到视频（T2V）扩散模型的进展为文本引导的视频修复开辟了道路。然而，直接将T2V模型应用于修复在统一补全与插入任务、输入可控性以及处理长视频方面仍存在局限，从而限制了其适用性和灵活性。为应对这些挑战，我们提出了MTV-Inpaint，一个统一的多任务视频修复框架，能够同时处理传统的场景补全和新型的对象插入任务。为了统一这些不同任务，我们在T2V扩散U-Net中设计了一种双分支空间注意力机制，使得场景补全与对象插入能在单一框架内无缝集成。除了文本引导外，MTV-Inpaint还通过我们提出的图像到视频（I2V）修复模式，整合多种图像修复模型，支持多模态控制。此外，我们提出了一种两阶段流程，结合关键帧修复与中间帧传播，使MTV-Inpaint能够有效处理包含数百帧的长视频。大量实验证明，MTV-Inpaint在场景补全和对象插入任务上均达到了最先进的性能。更进一步，它在多模态修复、对象编辑、移除、图像对象笔刷等衍生应用以及处理长视频的能力上展现了广泛的适用性。项目页面：https://mtv-inpaint.github.io/。

English

Video inpainting involves modifying local regions within a video, ensuring spatial and temporal consistency. Most existing methods focus primarily on scene completion (i.e., filling missing regions) and lack the capability to insert new objects into a scene in a controllable manner. Fortunately, recent advancements in text-to-video (T2V) diffusion models pave the way for text-guided video inpainting. However, directly adapting T2V models for inpainting remains limited in unifying completion and insertion tasks, lacks input controllability, and struggles with long videos, thereby restricting their applicability and flexibility. To address these challenges, we propose MTV-Inpaint, a unified multi-task video inpainting framework capable of handling both traditional scene completion and novel object insertion tasks. To unify these distinct tasks, we design a dual-branch spatial attention mechanism in the T2V diffusion U-Net, enabling seamless integration of scene completion and object insertion within a single framework. In addition to textual guidance, MTV-Inpaint supports multimodal control by integrating various image inpainting models through our proposed image-to-video (I2V) inpainting mode. Additionally, we propose a two-stage pipeline that combines keyframe inpainting with in-between frame propagation, enabling MTV-Inpaint to effectively handle long videos with hundreds of frames. Extensive experiments demonstrate that MTV-Inpaint achieves state-of-the-art performance in both scene completion and object insertion tasks. Furthermore, it demonstrates versatility in derived applications such as multi-modal inpainting, object editing, removal, image object brush, and the ability to handle long videos. Project page: https://mtv-inpaint.github.io/.