OmniShow：人間と物体のインタラクション動画生成のためのマルチモーダル条件の統合

要旨

本論文では、テキスト・参照画像・音声・ポーズを条件とした高品質な人物-物体相互作用動画生成（HOIVG）を研究する。この課題は、eコマースデモンストレーション、ショート動画制作、インタラクティブエンターテインメントなど、実世界のアプリケーションにおけるコンテンツ作成の自動化において重要な実用価値を持つ。しかし、既存手法はこれらの必要条件を全て満たすには至っていない。我々は、この実用的かつ困難な課題に特化したエンドツーエンドフレームワーク「OmniShow」を提案する。本フレームワークはマルチモーダル条件の調和を実現し、産業レベルの性能を発揮する。制御性と品質のトレードオフを克服するため、効率的な画像・ポーズ注入のための統一チャネル単位条件付けと、精密な音声-視覚同期を保証するゲート付き局所文脈注意機構を導入する。データ不足の問題に対処するため、異種サブタスクデータセットを効率的に活用する多段階訓練プロセスとモデルマージを組み合わせた分離後結合訓練戦略を開発した。さらに、本分野の評価基準を確立するため、HOIVG専用の総合的なベンチマーク「HOIVG-Bench」を構築した。大規模な実験により、OmniShowが様々なマルチモーダル条件設定において包括的な最先端性能を達成し、新興課題であるHOIVGの確固たる標準を確立することを実証した。

English

In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on text, reference images, audio, and pose. This task holds significant practical value for automating content creation in real-world applications, such as e-commerce demonstrations, short video production, and interactive entertainment. However, existing approaches fail to accommodate all these requisite conditions. We present OmniShow, an end-to-end framework tailored for this practical yet challenging task, capable of harmonizing multimodal conditions and delivering industry-grade performance. To overcome the trade-off between controllability and quality, we introduce Unified Channel-wise Conditioning for efficient image and pose injection, and Gated Local-Context Attention to ensure precise audio-visual synchronization. To effectively address data scarcity, we develop a Decoupled-Then-Joint Training strategy that leverages a multi-stage training process with model merging to efficiently harness heterogeneous sub-task datasets. Furthermore, to fill the evaluation gap in this field, we establish HOIVG-Bench, a dedicated and comprehensive benchmark for HOIVG. Extensive experiments demonstrate that OmniShow achieves overall state-of-the-art performance across various multimodal conditioning settings, setting a solid standard for the emerging HOIVG task.

OmniShow：人間と物体のインタラクション動画生成のためのマルチモーダル条件の統合

OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

要旨

Support