VideoPainter：具備即插即用上下文控制的任意長度視頻修復與編輯

摘要

視頻修復技術旨在恢復受損的視頻內容，已取得顯著進展。儘管如此，現有方法無論是通過光流和感受野先驗傳播未遮罩區域像素，還是將圖像修復模型在時間上進行擴展，都面臨著生成完全遮罩物體或在單一模型中平衡背景上下文保留與前景生成這兩個競爭目標的挑戰。為解決這些限制，我們提出了一種新穎的雙流範式VideoPainter，該範式包含一個高效的上下文編碼器（僅佔主幹參數的6%）來處理遮罩視頻，並將主幹感知的背景上下文線索注入任何預訓練的視頻DiT中，以即插即用的方式生成語義一致的內容。這種架構分離顯著降低了模型的學習複雜度，同時實現了關鍵背景上下文的細緻整合。我們還引入了一種新穎的目標區域ID重採樣技術，實現了任意長度的視頻修復，大大提升了實際應用性。此外，我們建立了一個可擴展的數據集管道，利用當前視覺理解模型，貢獻了VPData和VPBench，以促進基於分割的修復訓練和評估，這是迄今為止最大的視頻修復數據集和基準，包含超過39萬個多樣化的片段。以修復為管道基礎，我們還探索了下游應用，包括視頻編輯和視頻編輯對數據生成，展示了競爭力的性能和顯著的實際潛力。大量實驗表明，VideoPainter在任意長度視頻修復和編輯方面均表現優異，涵蓋視頻質量、遮罩區域保留和文本一致性等八個關鍵指標。

English

Video inpainting, which aims to restore corrupted video content, has experienced substantial progress. Despite these advances, existing methods, whether propagating unmasked region pixels through optical flow and receptive field priors, or extending image-inpainting models temporally, face challenges in generating fully masked objects or balancing the competing objectives of background context preservation and foreground generation in one model, respectively. To address these limitations, we propose a novel dual-stream paradigm VideoPainter that incorporates an efficient context encoder (comprising only 6% of the backbone parameters) to process masked videos and inject backbone-aware background contextual cues to any pre-trained video DiT, producing semantically consistent content in a plug-and-play manner. This architectural separation significantly reduces the model's learning complexity while enabling nuanced integration of crucial background context. We also introduce a novel target region ID resampling technique that enables any-length video inpainting, greatly enhancing our practical applicability. Additionally, we establish a scalable dataset pipeline leveraging current vision understanding models, contributing VPData and VPBench to facilitate segmentation-based inpainting training and assessment, the largest video inpainting dataset and benchmark to date with over 390K diverse clips. Using inpainting as a pipeline basis, we also explore downstream applications including video editing and video editing pair data generation, demonstrating competitive performance and significant practical potential. Extensive experiments demonstrate VideoPainter's superior performance in both any-length video inpainting and editing, across eight key metrics, including video quality, mask region preservation, and textual coherence.

VideoPainter：具備即插即用上下文控制的任意長度視頻修復與編輯

VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control

摘要

Support