HuMo: 협력적 다중 모드 조건화를 통한 인간 중심 비디오 생성

초록

인간 중심 비디오 생성(Human-Centric Video Generation, HCVG) 방법은 텍스트, 이미지, 오디오를 포함한 다중 모달 입력으로부터 인간 비디오를 합성하려는 접근법이다. 기존 방법들은 두 가지 주요 문제로 인해 이러한 이질적인 모달리티를 효과적으로 조율하는 데 어려움을 겪는다: 첫째, 삼중 조건(paired triplet conditions)을 갖춘 훈련 데이터의 부족, 둘째, 주체 보존(subject preservation)과 오디오-비지각 동기화(audio-visual sync)라는 하위 작업을 다중 모달 입력과 협력적으로 수행하는 어려움이다. 본 연구에서는 협력적 다중 모달 제어를 위한 통합 HCVG 프레임워크인 HuMo를 제안한다. 첫 번째 문제를 해결하기 위해, 다양한 텍스트, 참조 이미지, 오디오를 포함한 고품질 데이터셋을 구축하였다. 두 번째 문제를 해결하기 위해, 작업별 전략을 포함한 두 단계의 점진적 다중 모달 훈련 패러다임을 제안한다. 주체 보존 작업에서는 기초 모델의 프롬프트 추종 및 시각적 생성 능력을 유지하기 위해 최소 침습적 이미지 주입 전략을 채택하였다. 오디오-비지각 동기화 작업에서는 일반적으로 사용되는 오디오 교차 주의층(audio cross-attention layer) 외에도, 모델이 오디오를 얼굴 영역과 암묵적으로 연결하도록 유도하는 예측 기반 초점 전략(focus-by-predicting strategy)을 제안한다. 다중 모달 입력 간 제어 가능성을 공동으로 학습하기 위해, 이전에 획득한 능력을 기반으로 오디오-비지각 동기화 작업을 점진적으로 통합한다. 추론 과정에서 유연하고 세밀한 다중 모달 제어를 위해, 노이즈 제거 단계별로 가이던스 가중치를 동적으로 조정하는 시간 적응형 Classifier-Free Guidance 전략을 설계하였다. 광범위한 실험 결과를 통해 HuMo는 하위 작업에서 최신의 전문화된 방법들을 능가하며, 협력적 다중 모달 조건 HCVG를 위한 통합 프레임워크를 확립함을 입증하였다. 프로젝트 페이지: https://phantom-video.github.io/HuMo.

English

Human-Centric Video Generation (HCVG) methods seek to synthesize human videos from multimodal inputs, including text, image, and audio. Existing methods struggle to effectively coordinate these heterogeneous modalities due to two challenges: the scarcity of training data with paired triplet conditions and the difficulty of collaborating the sub-tasks of subject preservation and audio-visual sync with multimodal inputs. In this work, we present HuMo, a unified HCVG framework for collaborative multimodal control. For the first challenge, we construct a high-quality dataset with diverse and paired text, reference images, and audio. For the second challenge, we propose a two-stage progressive multimodal training paradigm with task-specific strategies. For the subject preservation task, to maintain the prompt following and visual generation abilities of the foundation model, we adopt the minimal-invasive image injection strategy. For the audio-visual sync task, besides the commonly adopted audio cross-attention layer, we propose a focus-by-predicting strategy that implicitly guides the model to associate audio with facial regions. For joint learning of controllabilities across multimodal inputs, building on previously acquired capabilities, we progressively incorporate the audio-visual sync task. During inference, for flexible and fine-grained multimodal control, we design a time-adaptive Classifier-Free Guidance strategy that dynamically adjusts guidance weights across denoising steps. Extensive experimental results demonstrate that HuMo surpasses specialized state-of-the-art methods in sub-tasks, establishing a unified framework for collaborative multimodal-conditioned HCVG. Project Page: https://phantom-video.github.io/HuMo.

HuMo: 협력적 다중 모드 조건화를 통한 인간 중심 비디오 생성

HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning

초록

Support