OmniInsert:基于扩散Transformer模型的无掩码任意参考视频插入
OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models
September 22, 2025
作者: Jinshu Chen, Xinghui Li, Xu Bai, Tianxiang Ma, Pengze Zhang, Zhuowei Chen, Gen Li, Lijie Liu, Songtao Zhao, Bingchuan Li, Qian He
cs.AI
摘要
基于扩散模型的视频插入技术近期取得了显著进展。然而,现有方法依赖复杂的控制信号,却在主体一致性上表现欠佳,限制了其实际应用。本文聚焦于无掩码视频插入任务,旨在解决三大关键挑战:数据稀缺、主体-场景平衡以及插入协调。针对数据稀缺问题,我们提出了一种新的数据管道InsertPipe,自动构建多样化的跨对数据。在此基础上,我们开发了OmniInsert,一个新颖的统一框架,用于从单主体和多主体参考中进行无掩码视频插入。具体而言,为保持主体-场景平衡,我们引入了一种简单而有效的条件特定特征注入机制,以清晰注入多源条件,并提出了一种渐进式训练策略,使模型能够平衡来自主体和源视频的特征注入。同时,我们设计了主体聚焦损失函数,以提升主体的细节表现。为进一步增强插入协调性,我们提出了一种插入偏好优化方法,通过模拟人类偏好来优化模型,并在参考过程中融入上下文感知重述模块,使主体无缝融入原始场景。针对该领域缺乏基准测试的问题,我们推出了InsertBench,一个包含多样化场景与精心挑选主体的综合基准。在InsertBench上的评估表明,OmniInsert超越了最先进的闭源商业解决方案。代码即将公开。
English
Recent advances in video insertion based on diffusion models are impressive.
However, existing methods rely on complex control signals but struggle with
subject consistency, limiting their practical applicability. In this paper, we
focus on the task of Mask-free Video Insertion and aim to resolve three key
challenges: data scarcity, subject-scene equilibrium, and insertion
harmonization. To address the data scarcity, we propose a new data pipeline
InsertPipe, constructing diverse cross-pair data automatically. Building upon
our data pipeline, we develop OmniInsert, a novel unified framework for
mask-free video insertion from both single and multiple subject references.
Specifically, to maintain subject-scene equilibrium, we introduce a simple yet
effective Condition-Specific Feature Injection mechanism to distinctly inject
multi-source conditions and propose a novel Progressive Training strategy that
enables the model to balance feature injection from subjects and source video.
Meanwhile, we design the Subject-Focused Loss to improve the detailed
appearance of the subjects. To further enhance insertion harmonization, we
propose an Insertive Preference Optimization methodology to optimize the model
by simulating human preferences, and incorporate a Context-Aware Rephraser
module during reference to seamlessly integrate the subject into the original
scenes. To address the lack of a benchmark for the field, we introduce
InsertBench, a comprehensive benchmark comprising diverse scenes with
meticulously selected subjects. Evaluation on InsertBench indicates OmniInsert
outperforms state-of-the-art closed-source commercial solutions. The code will
be released.