OmniShow: Unificatie van Multimodale Condities voor de Generatie van Mens-Object Interactievideo's

Samenvatting

In dit werk bestuderen we Human-Object Interaction Video Generation (HOIVG), wat als doel heeft hoogwaardige video's van mens-objectinteracties te genereren op basis van tekst, referentiebeelden, audio en pose. Deze taak heeft aanzienlijke praktische waarde voor het automatiseren van contentcreatie in real-world toepassingen, zoals e-commerce demonstraties, korte videoproductie en interactief entertainment. Bestaande benaderingen slagen er echter niet in om aan al deze vereiste condities te voldoen. Wij presenteren OmniShow, een end-to-end framework dat is toegesneden op deze praktische maar uitdagende taak, in staat om multimodale condities te harmoniseren en industrie-grade prestaties te leveren. Om de afweging tussen beheersbaarheid en kwaliteit te overwinnen, introduceren we Unified Channel-wise Conditioning voor efficiënte injectie van beelden en poses, en Gated Local-Context Attention om precieze audiovisuele synchronisatie te garanderen. Om data-schaarste effectief aan te pakken, ontwikkelen we een Decoupled-Then-Joint Training strategie die gebruikmaakt van een meerfasig trainingsproces met modelmerging om efficiënt gebruik te maken van heterogene sub-taakdatasets. Verder richten we HOIVG-Bench op, een toegewijd en uitgebreid benchmarkplatform voor HOIVG, om de evaluatielacune in dit veld te vullen. Uitgebreide experimenten tonen aan dat OmniShow over de hele linie state-of-the-art prestaties bereikt in diverse multimodale conditiesettings, waarmee het een solide standaard zet voor de opkomende HOIVG-taak.

English

In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on text, reference images, audio, and pose. This task holds significant practical value for automating content creation in real-world applications, such as e-commerce demonstrations, short video production, and interactive entertainment. However, existing approaches fail to accommodate all these requisite conditions. We present OmniShow, an end-to-end framework tailored for this practical yet challenging task, capable of harmonizing multimodal conditions and delivering industry-grade performance. To overcome the trade-off between controllability and quality, we introduce Unified Channel-wise Conditioning for efficient image and pose injection, and Gated Local-Context Attention to ensure precise audio-visual synchronization. To effectively address data scarcity, we develop a Decoupled-Then-Joint Training strategy that leverages a multi-stage training process with model merging to efficiently harness heterogeneous sub-task datasets. Furthermore, to fill the evaluation gap in this field, we establish HOIVG-Bench, a dedicated and comprehensive benchmark for HOIVG. Extensive experiments demonstrate that OmniShow achieves overall state-of-the-art performance across various multimodal conditioning settings, setting a solid standard for the emerging HOIVG task.

OmniShow: Unificatie van Multimodale Condities voor de Generatie van Mens-Object Interactievideo's

OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

Samenvatting

Support