OmniInsert：拡散トランスフォーマーモデルによるマスク不要の任意参照動画挿入

要旨

拡散モデルに基づくビデオ挿入技術の最近の進展は目覚ましいものがある。しかし、既存の手法は複雑な制御信号に依存しているものの、被写体の一貫性に課題を抱えており、実用性が制限されている。本論文では、マスクフリーのビデオ挿入タスクに焦点を当て、データ不足、被写体とシーンの均衡、挿入の調和という3つの主要な課題の解決を目指す。データ不足に対処するため、多様なクロスペアデータを自動的に構築する新しいデータパイプライン「InsertPipe」を提案する。このデータパイプラインを基盤として、単一および複数の被写体参照からのマスクフリーのビデオ挿入のための新たな統一フレームワーク「OmniInsert」を開発する。具体的には、被写体とシーンの均衡を維持するために、マルチソース条件を明確に注入するシンプルかつ効果的な「Condition-Specific Feature Injection」メカニズムを導入し、被写体とソースビデオからの特徴注入をバランスさせるための新しい「Progressive Training」戦略を提案する。同時に、被写体の詳細な外観を改善するために「Subject-Focused Loss」を設計する。さらに、挿入の調和を強化するために、人間の選好をシミュレートしてモデルを最適化する「Insertive Preference Optimization」手法を提案し、参照時に「Context-Aware Rephraser」モジュールを組み込むことで、被写体を元のシーンにシームレスに統合する。この分野におけるベンチマークの欠如に対処するため、慎重に選ばれた被写体を含む多様なシーンからなる包括的なベンチマーク「InsertBench」を導入する。InsertBenchでの評価により、OmniInsertが最先端のクローズドソースの商用ソリューションを上回ることが示された。コードは公開予定である。

English

Recent advances in video insertion based on diffusion models are impressive. However, existing methods rely on complex control signals but struggle with subject consistency, limiting their practical applicability. In this paper, we focus on the task of Mask-free Video Insertion and aim to resolve three key challenges: data scarcity, subject-scene equilibrium, and insertion harmonization. To address the data scarcity, we propose a new data pipeline InsertPipe, constructing diverse cross-pair data automatically. Building upon our data pipeline, we develop OmniInsert, a novel unified framework for mask-free video insertion from both single and multiple subject references. Specifically, to maintain subject-scene equilibrium, we introduce a simple yet effective Condition-Specific Feature Injection mechanism to distinctly inject multi-source conditions and propose a novel Progressive Training strategy that enables the model to balance feature injection from subjects and source video. Meanwhile, we design the Subject-Focused Loss to improve the detailed appearance of the subjects. To further enhance insertion harmonization, we propose an Insertive Preference Optimization methodology to optimize the model by simulating human preferences, and incorporate a Context-Aware Rephraser module during reference to seamlessly integrate the subject into the original scenes. To address the lack of a benchmark for the field, we introduce InsertBench, a comprehensive benchmark comprising diverse scenes with meticulously selected subjects. Evaluation on InsertBench indicates OmniInsert outperforms state-of-the-art closed-source commercial solutions. The code will be released.

OmniInsert：拡散トランスフォーマーモデルによるマスク不要の任意参照動画挿入

OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models

要旨

Support