OmniShow：統一多模態條件的人機互動影片生成

摘要

本研究聚焦於人體-物體互動影片生成（HOIVG）任務，旨在根據文本、參考圖像、音頻及姿態等多模態條件合成高質量的人機互動影片。該技術在電子商務展示、短視頻製作、互動娛樂等實際應用場景中具有重要的內容自動化生成價值。然而現有方法難以同時滿足所有必要條件。我們提出端到端框架OmniShow，專為此實用性強且具挑戰性的任務設計，能夠協調多模態條件並實現工業級性能。為克服可控性與生成質量間的權衡難題，我們引入統一通道條件注入機制實現高效的圖像與姿態融合，並採用門控局部上下文注意力確保精確的視聽同步。針對數據稀缺問題，我們開發解耦後聯合訓練策略，通過多階段訓練與模型融合技術有效利用異構子任務數據集。此外，為填補該領域的評估空白，我們建立了專屬綜合基準HOIVG-Bench。大量實驗表明，OmniShow在多種多模態條件設定下均實現了全面最優性能，為新興的HOIVG任務奠定了堅實基準。

English

In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on text, reference images, audio, and pose. This task holds significant practical value for automating content creation in real-world applications, such as e-commerce demonstrations, short video production, and interactive entertainment. However, existing approaches fail to accommodate all these requisite conditions. We present OmniShow, an end-to-end framework tailored for this practical yet challenging task, capable of harmonizing multimodal conditions and delivering industry-grade performance. To overcome the trade-off between controllability and quality, we introduce Unified Channel-wise Conditioning for efficient image and pose injection, and Gated Local-Context Attention to ensure precise audio-visual synchronization. To effectively address data scarcity, we develop a Decoupled-Then-Joint Training strategy that leverages a multi-stage training process with model merging to efficiently harness heterogeneous sub-task datasets. Furthermore, to fill the evaluation gap in this field, we establish HOIVG-Bench, a dedicated and comprehensive benchmark for HOIVG. Extensive experiments demonstrate that OmniShow achieves overall state-of-the-art performance across various multimodal conditioning settings, setting a solid standard for the emerging HOIVG task.

OmniShow：統一多模態條件的人機互動影片生成

OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

摘要

Support