OmniInsert: 디퓨전 트랜스포머 모델을 통한 마스크 없는 참조 기반 비디오 삽입

초록

디퓨전 모델 기반 비디오 삽입 기술의 최근 발전은 매우 인상적입니다. 그러나 기존 방법들은 복잡한 제어 신호에 의존하면서도 주체 일관성 문제에 어려움을 겪어 실용적 적용성이 제한되고 있습니다. 본 논문에서는 마스크 없는 비디오 삽입 작업에 초점을 맞추어 데이터 부족, 주체-장면 균형, 삽입 조화라는 세 가지 주요 과제를 해결하고자 합니다. 데이터 부족 문제를 해결하기 위해, 우리는 다양한 교차 쌍 데이터를 자동으로 구성하는 새로운 데이터 파이프라인인 InsertPipe를 제안합니다. 이 데이터 파이프라인을 기반으로, 단일 및 다중 주체 참조로부터 마스크 없는 비디오 삽입을 위한 새로운 통합 프레임워크인 OmniInsert를 개발했습니다. 특히, 주체-장면 균형을 유지하기 위해, 우리는 다중 소스 조건을 명확하게 주입하는 간단하지만 효과적인 Condition-Specific Feature Injection 메커니즘을 도입하고, 모델이 주체와 소스 비디오로부터의 특징 주입을 균형 있게 조절할 수 있도록 하는 Progressive Training 전략을 제안했습니다. 동시에, 주체의 세부 외관을 개선하기 위해 Subject-Focused Loss를 설계했습니다. 삽입 조화를 더욱 강화하기 위해, 우리는 인간의 선호도를 시뮬레이션하여 모델을 최적화하는 Insertive Preference Optimization 방법론을 제안하고, 참조 과정에서 Context-Aware Rephraser 모듈을 통합하여 주체를 원본 장면에 자연스럽게 통합했습니다. 해당 분야의 벤치마크 부재 문제를 해결하기 위해, 우리는 다양한 장면과 신중하게 선별된 주체로 구성된 포괄적인 벤치마크인 InsertBench를 소개합니다. InsertBench에서의 평가 결과, OmniInsert는 최첨단의 상용 솔루션들을 능가하는 성능을 보였습니다. 코드는 공개될 예정입니다.

English

Recent advances in video insertion based on diffusion models are impressive. However, existing methods rely on complex control signals but struggle with subject consistency, limiting their practical applicability. In this paper, we focus on the task of Mask-free Video Insertion and aim to resolve three key challenges: data scarcity, subject-scene equilibrium, and insertion harmonization. To address the data scarcity, we propose a new data pipeline InsertPipe, constructing diverse cross-pair data automatically. Building upon our data pipeline, we develop OmniInsert, a novel unified framework for mask-free video insertion from both single and multiple subject references. Specifically, to maintain subject-scene equilibrium, we introduce a simple yet effective Condition-Specific Feature Injection mechanism to distinctly inject multi-source conditions and propose a novel Progressive Training strategy that enables the model to balance feature injection from subjects and source video. Meanwhile, we design the Subject-Focused Loss to improve the detailed appearance of the subjects. To further enhance insertion harmonization, we propose an Insertive Preference Optimization methodology to optimize the model by simulating human preferences, and incorporate a Context-Aware Rephraser module during reference to seamlessly integrate the subject into the original scenes. To address the lack of a benchmark for the field, we introduce InsertBench, a comprehensive benchmark comprising diverse scenes with meticulously selected subjects. Evaluation on InsertBench indicates OmniInsert outperforms state-of-the-art closed-source commercial solutions. The code will be released.

OmniInsert: 디퓨전 트랜스포머 모델을 통한 마스크 없는 참조 기반 비디오 삽입

OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models

초록

Support