AMD-Hummingbird:迈向高效文本到视频生成模型
AMD-Hummingbird: Towards an Efficient Text-to-Video Model
March 24, 2025
作者: Takashi Isobe, He Cui, Dong Zhou, Mengmeng Ge, Dong Li, Emad Barsoum
cs.AI
摘要
文本到视频(T2V)生成技术因其能够从文本描述中合成逼真视频而备受关注。然而,现有模型在计算效率与高视觉质量之间难以平衡,特别是在资源受限的设备上,如集成显卡和移动电话。大多数先前工作优先考虑视觉保真度,却忽视了开发更小、更高效模型以适应实际部署的需求。为应对这一挑战,我们提出了一种轻量级T2V框架,命名为“蜂鸟”,该框架通过剪枝现有模型并结合视觉反馈学习提升视频质量。我们的方法将U-Net的参数规模从14亿缩减至7亿,显著提高了效率,同时保持了高质量的视频生成能力。此外,我们引入了一种新颖的数据处理流程,利用大型语言模型(LLMs)和视频质量评估(VQA)模型来提升文本提示和视频数据的质量。为支持用户驱动的训练和风格定制,我们公开了完整的训练代码,包括数据处理和模型训练。大量实验表明,我们的方法相比VideoCrafter2等最先进模型实现了31倍的加速,并在VBench上获得了最高综合评分。此外,我们的方法支持生成最多26帧的视频,解决了现有基于U-Net方法在长视频生成上的局限。值得注意的是,整个训练过程仅需四块GPU,却展现出与现有领先方法相媲美的性能。蜂鸟为T2V生成提供了一个实用且高效的解决方案,结合了高性能、可扩展性和实际应用的灵活性。
English
Text-to-Video (T2V) generation has attracted significant attention for its
ability to synthesize realistic videos from textual descriptions. However,
existing models struggle to balance computational efficiency and high visual
quality, particularly on resource-limited devices, e.g.,iGPUs and mobile
phones. Most prior work prioritizes visual fidelity while overlooking the need
for smaller, more efficient models suitable for real-world deployment. To
address this challenge, we propose a lightweight T2V framework, termed
Hummingbird, which prunes existing models and enhances visual quality through
visual feedback learning. Our approach reduces the size of the U-Net from 1.4
billion to 0.7 billion parameters, significantly improving efficiency while
preserving high-quality video generation. Additionally, we introduce a novel
data processing pipeline that leverages Large Language Models (LLMs) and Video
Quality Assessment (VQA) models to enhance the quality of both text prompts and
video data. To support user-driven training and style customization, we
publicly release the full training code, including data processing and model
training. Extensive experiments show that our method achieves a 31X speedup
compared to state-of-the-art models such as VideoCrafter2, while also attaining
the highest overall score on VBench. Moreover, our method supports the
generation of videos with up to 26 frames, addressing the limitations of
existing U-Net-based methods in long video generation. Notably, the entire
training process requires only four GPUs, yet delivers performance competitive
with existing leading methods. Hummingbird presents a practical and efficient
solution for T2V generation, combining high performance, scalability, and
flexibility for real-world applications.Summary
AI-Generated Summary