OmniInsert: Maskeringsvrije video-invoeging van elke referentie via diffusie-transformatormodellen

Samenvatting

Recente vooruitgang in video-invoeging op basis van diffusiemodellen is indrukwekkend. Bestaande methoden vertrouwen echter op complexe controlesignalen, maar hebben moeite met onderwerpconsistentie, wat hun praktische toepasbaarheid beperkt. In dit artikel richten we ons op de taak van Masker-vrije Video-invoeging en streven we ernaar drie belangrijke uitdagingen op te lossen: dataschaarste, onderwerp-scène-evenwicht en invoegingsharmonisatie. Om de dataschaarste aan te pakken, stellen we een nieuwe datapijplijn voor, InsertPipe, die automatisch diverse kruispaardata construeert. Op basis van onze datapijplijn ontwikkelen we OmniInsert, een nieuw uniform raamwerk voor masker-vrije video-invoeging vanuit zowel enkele als meerdere onderwerpverwijzingen. Specifiek introduceren we, om het onderwerp-scène-evenwicht te behouden, een eenvoudig maar effectief Condition-Specific Feature Injection-mechanisme om multi-broncondities duidelijk in te spuiten en stellen we een nieuwe Progressieve Trainingsstrategie voor die het model in staat stelt om feature-injectie van onderwerpen en bronvideo in evenwicht te brengen. Tegelijkertijd ontwerpen we de Subject-Focused Loss om het gedetailleerde uiterlijk van de onderwerpen te verbeteren. Om de invoegingsharmonisatie verder te verbeteren, stellen we een Insertive Preference Optimization-methodologie voor om het model te optimaliseren door menselijke voorkeuren te simuleren, en integreren we een Context-Aware Rephraser-module tijdens de verwijzing om het onderwerp naadloos in de originele scènes te integreren. Om het gebrek aan een benchmark voor het veld aan te pakken, introduceren we InsertBench, een uitgebreide benchmark bestaande uit diverse scènes met zorgvuldig geselecteerde onderwerpen. Evaluatie op InsertBench geeft aan dat OmniInsert state-of-the-art closed-source commerciële oplossingen overtreft. De code zal worden vrijgegeven.

English

Recent advances in video insertion based on diffusion models are impressive. However, existing methods rely on complex control signals but struggle with subject consistency, limiting their practical applicability. In this paper, we focus on the task of Mask-free Video Insertion and aim to resolve three key challenges: data scarcity, subject-scene equilibrium, and insertion harmonization. To address the data scarcity, we propose a new data pipeline InsertPipe, constructing diverse cross-pair data automatically. Building upon our data pipeline, we develop OmniInsert, a novel unified framework for mask-free video insertion from both single and multiple subject references. Specifically, to maintain subject-scene equilibrium, we introduce a simple yet effective Condition-Specific Feature Injection mechanism to distinctly inject multi-source conditions and propose a novel Progressive Training strategy that enables the model to balance feature injection from subjects and source video. Meanwhile, we design the Subject-Focused Loss to improve the detailed appearance of the subjects. To further enhance insertion harmonization, we propose an Insertive Preference Optimization methodology to optimize the model by simulating human preferences, and incorporate a Context-Aware Rephraser module during reference to seamlessly integrate the subject into the original scenes. To address the lack of a benchmark for the field, we introduce InsertBench, a comprehensive benchmark comprising diverse scenes with meticulously selected subjects. Evaluation on InsertBench indicates OmniInsert outperforms state-of-the-art closed-source commercial solutions. The code will be released.

OmniInsert: Maskeringsvrije video-invoeging van elke referentie via diffusie-transformatormodellen

OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models

Samenvatting

Support