OmniShow: 인간-물체 상호작용 비디오 생성을 위한 다중 모달 조건 통합

초록

본 연구에서는 텍스트, 참조 이미지, 오디오, 포즈를 조건으로 고품질의 인간-객체 상호작용 동영상을 합성하는 Human-Object Interaction Video Generation(HOIVG)을 연구합니다. 이 작업은 전자상거래 데모, 숏폼 영상 제작, 인터랙티브 엔터테인먼트 등 실제 응용 프로그램에서 콘텐츠 제작을 자동화하는 데 중요한 실용적 가치를 지닙니다. 그러나 기존 방법론은 이러한 모든 필수 조건을 충족시키지 못합니다. 본 논문은 이러한 실용적이면서도 도전적인 과제를 위해 특화된 end-to-end 프레임워크인 OmniShow를 제안하며, 이는 다중 모드 조건을 조화롭게 통합하고 산업 수준의 성능을 제공할 수 있습니다. 제어성과 품질 간의 트레이드오프를 극복하기 위해 효율적인 이미지 및 포즈 주입을 위한 통합 채널별 조건부 기법과 정확한 오디오-비주얼 동기화를 보장하는 게이트 지역-문맥 어텐션 메커니즘을 도입했습니다. 데이터 부족 문제를 효과적으로 해결하기 위해 이기종 하위 작업 데이터셋을 효율적으로 활용하는 모델 병합과 다단계 학습 과정을 통한 분리-후-통합 학습 전략을 개발했습니다. 더 나아가 해당 분야의 평가 공백을 메우기 위해 HOIVG 전용 포괄적 벤치마크인 HOIVG-Bench를 구축했습니다. 다양한 다중 모드 조건 설정에서 진행된 폭넓은 실험을 통해 OmniShow가 전반적인 최첨단 성능을 달성함으로써 새롭게 부상하는 HOIVG 과업에 견고한 기준을 제시함을 입증했습니다.

English

In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on text, reference images, audio, and pose. This task holds significant practical value for automating content creation in real-world applications, such as e-commerce demonstrations, short video production, and interactive entertainment. However, existing approaches fail to accommodate all these requisite conditions. We present OmniShow, an end-to-end framework tailored for this practical yet challenging task, capable of harmonizing multimodal conditions and delivering industry-grade performance. To overcome the trade-off between controllability and quality, we introduce Unified Channel-wise Conditioning for efficient image and pose injection, and Gated Local-Context Attention to ensure precise audio-visual synchronization. To effectively address data scarcity, we develop a Decoupled-Then-Joint Training strategy that leverages a multi-stage training process with model merging to efficiently harness heterogeneous sub-task datasets. Furthermore, to fill the evaluation gap in this field, we establish HOIVG-Bench, a dedicated and comprehensive benchmark for HOIVG. Extensive experiments demonstrate that OmniShow achieves overall state-of-the-art performance across various multimodal conditioning settings, setting a solid standard for the emerging HOIVG task.

OmniShow: 인간-물체 상호작용 비디오 생성을 위한 다중 모달 조건 통합

OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

초록

Support