OmniShow:统一多模态条件的人机交互视频生成
OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation
April 13, 2026
作者: Donghao Zhou, Guisheng Liu, Hao Yang, Jiatong Li, Jingyu Lin, Xiaohu Huang, Yichen Liu, Xin Gao, Cunjian Chen, Shilei Wen, Chi-Wing Fu, Pheng-Ann Heng
cs.AI
摘要
本文研究人-物交互视频生成任务,该任务旨在基于文本、参考图像、音频与姿态等条件生成高质量的人-物交互视频。该技术对于电子商务展示、短视频制作、交互式娱乐等实际应用中的内容自动化创作具有重要价值。然而,现有方法难以同时满足所有必要条件。我们提出端到端框架OmniShow,专为这一实用而富有挑战性的任务设计,能够协调多模态条件并实现工业级性能。为突破可控性与生成质量之间的权衡,我们提出统一通道条件注入机制以实现高效的图像与姿态条件融合,并设计门控局部上下文注意力模块确保精准的视听同步。针对数据稀缺问题,我们开发了解耦式联合训练策略,通过多阶段训练与模型融合技术高效利用异构子任务数据集。此外,为填补该领域评估空白,我们建立了首个专用综合评估基准HOIVG-Bench。大量实验表明,OmniShow在各种多模态条件设置下均达到最优性能,为新兴的人-物交互视频生成任务树立了坚实基准。
English
In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on text, reference images, audio, and pose. This task holds significant practical value for automating content creation in real-world applications, such as e-commerce demonstrations, short video production, and interactive entertainment. However, existing approaches fail to accommodate all these requisite conditions. We present OmniShow, an end-to-end framework tailored for this practical yet challenging task, capable of harmonizing multimodal conditions and delivering industry-grade performance. To overcome the trade-off between controllability and quality, we introduce Unified Channel-wise Conditioning for efficient image and pose injection, and Gated Local-Context Attention to ensure precise audio-visual synchronization. To effectively address data scarcity, we develop a Decoupled-Then-Joint Training strategy that leverages a multi-stage training process with model merging to efficiently harness heterogeneous sub-task datasets. Furthermore, to fill the evaluation gap in this field, we establish HOIVG-Bench, a dedicated and comprehensive benchmark for HOIVG. Extensive experiments demonstrate that OmniShow achieves overall state-of-the-art performance across various multimodal conditioning settings, setting a solid standard for the emerging HOIVG task.