ChatPaper.aiChatPaper

MTV-Inpaint:多任務長視頻修復技術

MTV-Inpaint: Multi-Task Long Video Inpainting

March 14, 2025
作者: Shiyuan Yang, Zheng Gu, Liang Hou, Xin Tao, Pengfei Wan, Xiaodong Chen, Jing Liao
cs.AI

摘要

影片修復涉及修改影片中的局部區域,確保空間和時間上的一致性。現有方法大多專注於場景補全(即填補缺失區域),而缺乏以可控方式在場景中插入新物體的能力。幸運的是,近期文字到影片(T2V)擴散模型的進展為文字引導的影片修復鋪平了道路。然而,直接將T2V模型應用於修復在統一補全與插入任務方面仍顯不足,缺乏輸入可控性,且在處理長影片時存在困難,從而限制了其應用範圍和靈活性。為應對這些挑戰,我們提出了MTV-Inpaint,一個統一的多任務影片修復框架,能夠處理傳統的場景補全和新穎的物體插入任務。為統一這些不同任務,我們在T2V擴散U-Net中設計了雙分支空間注意力機制,實現了場景補全與物體插入在單一框架內的無縫整合。除了文字引導外,MTV-Inpaint還通過我們提出的圖像到影片(I2V)修復模式,整合多種圖像修復模型,支援多模態控制。此外,我們提出了一個兩階段流程,結合關鍵幀修復與中間幀傳播,使MTV-Inpaint能夠有效處理包含數百幀的長影片。大量實驗證明,MTV-Inpaint在場景補全和物體插入任務中均達到了最先進的性能。此外,它在多模態修復、物體編輯、移除、圖像物體筆刷等衍生應用中展現了多樣性,並具備處理長影片的能力。項目頁面:https://mtv-inpaint.github.io/。
English
Video inpainting involves modifying local regions within a video, ensuring spatial and temporal consistency. Most existing methods focus primarily on scene completion (i.e., filling missing regions) and lack the capability to insert new objects into a scene in a controllable manner. Fortunately, recent advancements in text-to-video (T2V) diffusion models pave the way for text-guided video inpainting. However, directly adapting T2V models for inpainting remains limited in unifying completion and insertion tasks, lacks input controllability, and struggles with long videos, thereby restricting their applicability and flexibility. To address these challenges, we propose MTV-Inpaint, a unified multi-task video inpainting framework capable of handling both traditional scene completion and novel object insertion tasks. To unify these distinct tasks, we design a dual-branch spatial attention mechanism in the T2V diffusion U-Net, enabling seamless integration of scene completion and object insertion within a single framework. In addition to textual guidance, MTV-Inpaint supports multimodal control by integrating various image inpainting models through our proposed image-to-video (I2V) inpainting mode. Additionally, we propose a two-stage pipeline that combines keyframe inpainting with in-between frame propagation, enabling MTV-Inpaint to effectively handle long videos with hundreds of frames. Extensive experiments demonstrate that MTV-Inpaint achieves state-of-the-art performance in both scene completion and object insertion tasks. Furthermore, it demonstrates versatility in derived applications such as multi-modal inpainting, object editing, removal, image object brush, and the ability to handle long videos. Project page: https://mtv-inpaint.github.io/.

Summary

AI-Generated Summary

PDF102March 18, 2025