ChatPaper.aiChatPaper

HunyuanCustom:面向定制化视频生成的多模态驱动架构

HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation

May 7, 2025
作者: Teng Hu, Zhentao Yu, Zhengguang Zhou, Sen Liang, Yuan Zhou, Qin Lin, Qinglin Lu
cs.AI

摘要

定制视频生成旨在根据用户灵活定义的条件,生成包含特定主体的视频,然而现有方法在身份一致性和输入模态多样性方面往往存在局限。本文提出HunyuanCustom,一种多模态定制视频生成框架,强调主体一致性的同时,支持图像、音频、视频及文本条件。基于HunyuanVideo,我们的模型首先通过引入基于LLaVA的文本-图像融合模块来增强多模态理解能力,并采用图像ID增强模块,利用时间序列拼接强化跨帧身份特征,从而解决图像-文本条件下的生成任务。为实现音频和视频条件下的生成,我们进一步提出了模态特定的条件注入机制:AudioNet模块通过空间交叉注意力实现层次化对齐,以及视频驱动注入模块,通过基于分块的特征对齐网络整合潜在压缩的条件视频。在单主体和多主体场景下的广泛实验表明,HunyuanCustom在ID一致性、真实感和文本-视频对齐方面显著优于当前最先进的开放和闭源方法。此外,我们验证了其在下游任务中的鲁棒性,包括音频和视频驱动的定制视频生成。我们的结果凸显了多模态条件注入和身份保持策略在推进可控视频生成方面的有效性。所有代码和模型均可在https://hunyuancustom.github.io获取。
English
Customized video generation aims to produce videos featuring specific subjects under flexible user-defined conditions, yet existing methods often struggle with identity consistency and limited input modalities. In this paper, we propose HunyuanCustom, a multi-modal customized video generation framework that emphasizes subject consistency while supporting image, audio, video, and text conditions. Built upon HunyuanVideo, our model first addresses the image-text conditioned generation task by introducing a text-image fusion module based on LLaVA for enhanced multi-modal understanding, along with an image ID enhancement module that leverages temporal concatenation to reinforce identity features across frames. To enable audio- and video-conditioned generation, we further propose modality-specific condition injection mechanisms: an AudioNet module that achieves hierarchical alignment via spatial cross-attention, and a video-driven injection module that integrates latent-compressed conditional video through a patchify-based feature-alignment network. Extensive experiments on single- and multi-subject scenarios demonstrate that HunyuanCustom significantly outperforms state-of-the-art open- and closed-source methods in terms of ID consistency, realism, and text-video alignment. Moreover, we validate its robustness across downstream tasks, including audio and video-driven customized video generation. Our results highlight the effectiveness of multi-modal conditioning and identity-preserving strategies in advancing controllable video generation. All the code and models are available at https://hunyuancustom.github.io.

Summary

AI-Generated Summary

PDF243May 8, 2025