HuMo：マルチモーダル条件付けによる協調的人間中心ビデオ生成

要旨

人間中心のビデオ生成（HCVG）手法は、テキスト、画像、音声などの多様な入力から人間のビデオを合成することを目指しています。既存の手法では、二つの課題によりこれらの異種モダリティを効果的に調整することが困難です。一つは、三重の条件が揃った学習データの不足、もう一つは、被写体の保存と音声-視覚同期というサブタスクを多モダリティ入力と協調させる難しさです。本研究では、協調的な多モダリティ制御のための統一HCVGフレームワークであるHuMoを提案します。最初の課題に対して、多様でペアになったテキスト、参照画像、音声を含む高品質なデータセットを構築しました。二つ目の課題に対して、タスク固有の戦略を用いた二段階の漸進的多モダリティ学習パラダイムを提案します。被写体保存タスクでは、基盤モデルのプロンプト追従能力と視覚生成能力を維持するため、最小限の侵襲的な画像注入戦略を採用します。音声-視覚同期タスクでは、一般的に採用される音声クロスアテンションレイヤーに加え、モデルが音声を顔の領域と関連付けるよう暗黙的に導く「予測によるフォーカス」戦略を提案します。多モダリティ入力間の制御可能性を共同で学習するため、以前に獲得した能力を基に、音声-視覚同期タスクを段階的に組み込みます。推論時には、柔軟で細かな多モダリティ制御のため、ノイズ除去ステップ間でガイダンス重みを動的に調整する時間適応型Classifier-Free Guidance戦略を設計します。広範な実験結果により、HuMoはサブタスクにおいて専門的な最先端手法を凌駕し、協調的な多モダリティ条件付きHCVGのための統一フレームワークを確立しました。プロジェクトページ: https://phantom-video.github.io/HuMo

English

Human-Centric Video Generation (HCVG) methods seek to synthesize human videos from multimodal inputs, including text, image, and audio. Existing methods struggle to effectively coordinate these heterogeneous modalities due to two challenges: the scarcity of training data with paired triplet conditions and the difficulty of collaborating the sub-tasks of subject preservation and audio-visual sync with multimodal inputs. In this work, we present HuMo, a unified HCVG framework for collaborative multimodal control. For the first challenge, we construct a high-quality dataset with diverse and paired text, reference images, and audio. For the second challenge, we propose a two-stage progressive multimodal training paradigm with task-specific strategies. For the subject preservation task, to maintain the prompt following and visual generation abilities of the foundation model, we adopt the minimal-invasive image injection strategy. For the audio-visual sync task, besides the commonly adopted audio cross-attention layer, we propose a focus-by-predicting strategy that implicitly guides the model to associate audio with facial regions. For joint learning of controllabilities across multimodal inputs, building on previously acquired capabilities, we progressively incorporate the audio-visual sync task. During inference, for flexible and fine-grained multimodal control, we design a time-adaptive Classifier-Free Guidance strategy that dynamically adjusts guidance weights across denoising steps. Extensive experimental results demonstrate that HuMo surpasses specialized state-of-the-art methods in sub-tasks, establishing a unified framework for collaborative multimodal-conditioned HCVG. Project Page: https://phantom-video.github.io/HuMo.

HuMo：マルチモーダル条件付けによる協調的人間中心ビデオ生成

HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning

要旨

Support