OmniShow：统一多模态条件的人机交互视频生成

摘要

本文研究人-物交互视频生成任务，该任务旨在基于文本、参考图像、音频与姿态等条件生成高质量的人-物交互视频。该技术对于电子商务展示、短视频制作、交互式娱乐等实际应用中的内容自动化创作具有重要价值。然而，现有方法难以同时满足所有必要条件。我们提出端到端框架OmniShow，专为这一实用而富有挑战性的任务设计，能够协调多模态条件并实现工业级性能。为突破可控性与生成质量之间的权衡，我们提出统一通道条件注入机制以实现高效的图像与姿态条件融合，并设计门控局部上下文注意力模块确保精准的视听同步。针对数据稀缺问题，我们开发了解耦式联合训练策略，通过多阶段训练与模型融合技术高效利用异构子任务数据集。此外，为填补该领域评估空白，我们建立了首个专用综合评估基准HOIVG-Bench。大量实验表明，OmniShow在各种多模态条件设置下均达到最优性能，为新兴的人-物交互视频生成任务树立了坚实基准。

English

In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on text, reference images, audio, and pose. This task holds significant practical value for automating content creation in real-world applications, such as e-commerce demonstrations, short video production, and interactive entertainment. However, existing approaches fail to accommodate all these requisite conditions. We present OmniShow, an end-to-end framework tailored for this practical yet challenging task, capable of harmonizing multimodal conditions and delivering industry-grade performance. To overcome the trade-off between controllability and quality, we introduce Unified Channel-wise Conditioning for efficient image and pose injection, and Gated Local-Context Attention to ensure precise audio-visual synchronization. To effectively address data scarcity, we develop a Decoupled-Then-Joint Training strategy that leverages a multi-stage training process with model merging to efficiently harness heterogeneous sub-task datasets. Furthermore, to fill the evaluation gap in this field, we establish HOIVG-Bench, a dedicated and comprehensive benchmark for HOIVG. Extensive experiments demonstrate that OmniShow achieves overall state-of-the-art performance across various multimodal conditioning settings, setting a solid standard for the emerging HOIVG task.