OmniInsert：基于扩散Transformer模型的无掩码任意参考视频插入

摘要

基于扩散模型的视频插入技术近期取得了显著进展。然而，现有方法依赖复杂的控制信号，却在主体一致性上表现欠佳，限制了其实际应用。本文聚焦于无掩码视频插入任务，旨在解决三大关键挑战：数据稀缺、主体-场景平衡以及插入协调。针对数据稀缺问题，我们提出了一种新的数据管道InsertPipe，自动构建多样化的跨对数据。在此基础上，我们开发了OmniInsert，一个新颖的统一框架，用于从单主体和多主体参考中进行无掩码视频插入。具体而言，为保持主体-场景平衡，我们引入了一种简单而有效的条件特定特征注入机制，以清晰注入多源条件，并提出了一种渐进式训练策略，使模型能够平衡来自主体和源视频的特征注入。同时，我们设计了主体聚焦损失函数，以提升主体的细节表现。为进一步增强插入协调性，我们提出了一种插入偏好优化方法，通过模拟人类偏好来优化模型，并在参考过程中融入上下文感知重述模块，使主体无缝融入原始场景。针对该领域缺乏基准测试的问题，我们推出了InsertBench，一个包含多样化场景与精心挑选主体的综合基准。在InsertBench上的评估表明，OmniInsert超越了最先进的闭源商业解决方案。代码即将公开。

English

Recent advances in video insertion based on diffusion models are impressive. However, existing methods rely on complex control signals but struggle with subject consistency, limiting their practical applicability. In this paper, we focus on the task of Mask-free Video Insertion and aim to resolve three key challenges: data scarcity, subject-scene equilibrium, and insertion harmonization. To address the data scarcity, we propose a new data pipeline InsertPipe, constructing diverse cross-pair data automatically. Building upon our data pipeline, we develop OmniInsert, a novel unified framework for mask-free video insertion from both single and multiple subject references. Specifically, to maintain subject-scene equilibrium, we introduce a simple yet effective Condition-Specific Feature Injection mechanism to distinctly inject multi-source conditions and propose a novel Progressive Training strategy that enables the model to balance feature injection from subjects and source video. Meanwhile, we design the Subject-Focused Loss to improve the detailed appearance of the subjects. To further enhance insertion harmonization, we propose an Insertive Preference Optimization methodology to optimize the model by simulating human preferences, and incorporate a Context-Aware Rephraser module during reference to seamlessly integrate the subject into the original scenes. To address the lack of a benchmark for the field, we introduce InsertBench, a comprehensive benchmark comprising diverse scenes with meticulously selected subjects. Evaluation on InsertBench indicates OmniInsert outperforms state-of-the-art closed-source commercial solutions. The code will be released.

OmniInsert：基于扩散Transformer模型的无掩码任意参考视频插入

OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models

摘要

Support