HunyuanCustom：カスタムビデオ生成のためのマルチモーダル駆動型アーキテクチャ

要旨

カスタマイズ動画生成は、柔軟なユーザー定義条件のもとで特定の被写体を特徴とする動画を生成することを目指すが、既存の手法では被写体の一貫性や入力モダリティの制限に課題がある。本論文では、HunyuanCustomを提案する。これは、画像、音声、動画、テキストの条件をサポートしつつ、被写体の一貫性を重視したマルチモーダルカスタマイズ動画生成フレームワークである。HunyuanVideoを基盤とした本モデルは、まず画像-テキスト条件付き生成タスクに対処するため、LLaVAに基づくテキスト-画像融合モジュールを導入し、マルチモーダル理解を強化する。さらに、フレーム間のID特徴を強化するため、時間的連結を活用した画像ID強化モジュールを採用する。音声および動画条件付き生成を可能にするため、モダリティ固有の条件注入メカニズムを提案する。具体的には、空間的クロスアテンションによる階層的アラインメントを実現するAudioNetモジュールと、パッチ化ベースの特徴アラインメントネットワークを通じて潜在圧縮された条件付き動画を統合する動画駆動注入モジュールである。単一被写体および複数被写体シナリオにおける広範な実験により、HunyuanCustomがID一貫性、リアリズム、テキスト-動画アラインメントの点で、オープンソースおよびクローズドソースの最先端手法を大幅に上回ることを実証した。さらに、音声および動画駆動のカスタマイズ動画生成を含む下流タスクにおける堅牢性を検証した。我々の結果は、制御可能な動画生成を進化させる上で、マルチモーダル条件付けとID保存戦略の有効性を強調している。全てのコードとモデルはhttps://hunyuancustom.github.ioで公開されている。

English

Customized video generation aims to produce videos featuring specific subjects under flexible user-defined conditions, yet existing methods often struggle with identity consistency and limited input modalities. In this paper, we propose HunyuanCustom, a multi-modal customized video generation framework that emphasizes subject consistency while supporting image, audio, video, and text conditions. Built upon HunyuanVideo, our model first addresses the image-text conditioned generation task by introducing a text-image fusion module based on LLaVA for enhanced multi-modal understanding, along with an image ID enhancement module that leverages temporal concatenation to reinforce identity features across frames. To enable audio- and video-conditioned generation, we further propose modality-specific condition injection mechanisms: an AudioNet module that achieves hierarchical alignment via spatial cross-attention, and a video-driven injection module that integrates latent-compressed conditional video through a patchify-based feature-alignment network. Extensive experiments on single- and multi-subject scenarios demonstrate that HunyuanCustom significantly outperforms state-of-the-art open- and closed-source methods in terms of ID consistency, realism, and text-video alignment. Moreover, we validate its robustness across downstream tasks, including audio and video-driven customized video generation. Our results highlight the effectiveness of multi-modal conditioning and identity-preserving strategies in advancing controllable video generation. All the code and models are available at https://hunyuancustom.github.io.

HunyuanCustom：カスタムビデオ生成のためのマルチモーダル駆動型アーキテクチャ

HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation

要旨

Support