ChatPaper.aiChatPaper

HunyuanCustom:面向定制化视频生成的多模态驱动架构

HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation

May 7, 2025
作者: Teng Hu, Zhentao Yu, Zhengguang Zhou, Sen Liang, Yuan Zhou, Qin Lin, Qinglin Lu
cs.AI

摘要

定制化視頻生成旨在根據用戶靈活定義的條件,生成包含特定主體的視頻。然而,現有方法往往在身份一致性和輸入模式的多樣性方面存在挑戰。本文提出HunyuanCustom,這是一個多模態定制視頻生成框架,強調主體一致性的同時支持圖像、音頻、視頻和文本條件。基於HunyuanVideo,我們的模型首先通過引入基於LLaVA的文本-圖像融合模塊來增強多模態理解,並利用時間串聯的圖像ID增強模塊來強化跨幀的身份特徵,從而解決圖像-文本條件下的生成任務。為了實現音頻和視頻條件下的生成,我們進一步提出了特定模態的條件注入機制:一個通過空間交叉注意力實現層次對齊的AudioNet模塊,以及一個通過基於patchify的特徵對齊網絡整合潛在壓縮條件視頻的視頻驅動注入模塊。在單一和多主體場景下的廣泛實驗表明,HunyuanCustom在ID一致性、真實感和文本-視頻對齊方面顯著優於現有的開源和閉源方法。此外,我們驗證了其在下游任務中的魯棒性,包括音頻和視頻驅動的定制視頻生成。我們的結果凸顯了多模態條件和身份保留策略在推進可控視頻生成方面的有效性。所有代碼和模型均可通過https://hunyuancustom.github.io獲取。
English
Customized video generation aims to produce videos featuring specific subjects under flexible user-defined conditions, yet existing methods often struggle with identity consistency and limited input modalities. In this paper, we propose HunyuanCustom, a multi-modal customized video generation framework that emphasizes subject consistency while supporting image, audio, video, and text conditions. Built upon HunyuanVideo, our model first addresses the image-text conditioned generation task by introducing a text-image fusion module based on LLaVA for enhanced multi-modal understanding, along with an image ID enhancement module that leverages temporal concatenation to reinforce identity features across frames. To enable audio- and video-conditioned generation, we further propose modality-specific condition injection mechanisms: an AudioNet module that achieves hierarchical alignment via spatial cross-attention, and a video-driven injection module that integrates latent-compressed conditional video through a patchify-based feature-alignment network. Extensive experiments on single- and multi-subject scenarios demonstrate that HunyuanCustom significantly outperforms state-of-the-art open- and closed-source methods in terms of ID consistency, realism, and text-video alignment. Moreover, we validate its robustness across downstream tasks, including audio and video-driven customized video generation. Our results highlight the effectiveness of multi-modal conditioning and identity-preserving strategies in advancing controllable video generation. All the code and models are available at https://hunyuancustom.github.io.
PDF353May 8, 2025