OmniInsert：基於擴散變換器模型的無遮罩任意參考視頻插入技術

摘要

基於擴散模型的視頻插入技術近期取得了顯著進展。然而，現有方法依賴於複雜的控制信號，卻在保持主體一致性方面存在困難，這限制了其實際應用。本文聚焦於無遮罩視頻插入任務，旨在解決三大關鍵挑戰：數據稀缺、主體與場景的平衡以及插入的和諧性。針對數據稀缺問題，我們提出了一種新的數據管道InsertPipe，自動構建多樣化的跨對數據。基於此數據管道，我們開發了OmniInsert，這是一個新穎的統一框架，適用於從單一或多個主體參考進行無遮罩視頻插入。具體而言，為維持主體與場景的平衡，我們引入了一種簡單而有效的條件特徵注入機制，以清晰注入多源條件，並提出了一種新穎的漸進式訓練策略，使模型能夠平衡來自主體和源視頻的特徵注入。同時，我們設計了主體聚焦損失函數，以提升主體的細節表現。為了進一步增強插入的和諧性，我們提出了一種插入偏好優化方法，通過模擬人類偏好來優化模型，並在參考過程中整合了上下文感知重述模塊，以無縫地將主體融入原始場景。針對該領域缺乏基準測試的問題，我們引入了InsertBench，這是一個包含多樣場景並精心挑選主體的綜合基準。在InsertBench上的評估表明，OmniInsert超越了現有的閉源商業解決方案。代碼將予以公開。

English

Recent advances in video insertion based on diffusion models are impressive. However, existing methods rely on complex control signals but struggle with subject consistency, limiting their practical applicability. In this paper, we focus on the task of Mask-free Video Insertion and aim to resolve three key challenges: data scarcity, subject-scene equilibrium, and insertion harmonization. To address the data scarcity, we propose a new data pipeline InsertPipe, constructing diverse cross-pair data automatically. Building upon our data pipeline, we develop OmniInsert, a novel unified framework for mask-free video insertion from both single and multiple subject references. Specifically, to maintain subject-scene equilibrium, we introduce a simple yet effective Condition-Specific Feature Injection mechanism to distinctly inject multi-source conditions and propose a novel Progressive Training strategy that enables the model to balance feature injection from subjects and source video. Meanwhile, we design the Subject-Focused Loss to improve the detailed appearance of the subjects. To further enhance insertion harmonization, we propose an Insertive Preference Optimization methodology to optimize the model by simulating human preferences, and incorporate a Context-Aware Rephraser module during reference to seamlessly integrate the subject into the original scenes. To address the lack of a benchmark for the field, we introduce InsertBench, a comprehensive benchmark comprising diverse scenes with meticulously selected subjects. Evaluation on InsertBench indicates OmniInsert outperforms state-of-the-art closed-source commercial solutions. The code will be released.

OmniInsert：基於擴散變換器模型的無遮罩任意參考視頻插入技術

OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models

摘要

Support