AMD-Hummingbird：迈向高效文本到视频生成模型

摘要

文本到视频（T2V）生成技术因其能够从文本描述中合成逼真视频而备受关注。然而，现有模型在计算效率与高视觉质量之间难以平衡，特别是在资源受限的设备上，如集成显卡和移动电话。大多数先前工作优先考虑视觉保真度，却忽视了开发更小、更高效模型以适应实际部署的需求。为应对这一挑战，我们提出了一种轻量级T2V框架，命名为“蜂鸟”，该框架通过剪枝现有模型并结合视觉反馈学习提升视频质量。我们的方法将U-Net的参数规模从14亿缩减至7亿，显著提高了效率，同时保持了高质量的视频生成能力。此外，我们引入了一种新颖的数据处理流程，利用大型语言模型（LLMs）和视频质量评估（VQA）模型来提升文本提示和视频数据的质量。为支持用户驱动的训练和风格定制，我们公开了完整的训练代码，包括数据处理和模型训练。大量实验表明，我们的方法相比VideoCrafter2等最先进模型实现了31倍的加速，并在VBench上获得了最高综合评分。此外，我们的方法支持生成最多26帧的视频，解决了现有基于U-Net方法在长视频生成上的局限。值得注意的是，整个训练过程仅需四块GPU，却展现出与现有领先方法相媲美的性能。蜂鸟为T2V生成提供了一个实用且高效的解决方案，结合了高性能、可扩展性和实际应用的灵活性。

English

Text-to-Video (T2V) generation has attracted significant attention for its ability to synthesize realistic videos from textual descriptions. However, existing models struggle to balance computational efficiency and high visual quality, particularly on resource-limited devices, e.g.,iGPUs and mobile phones. Most prior work prioritizes visual fidelity while overlooking the need for smaller, more efficient models suitable for real-world deployment. To address this challenge, we propose a lightweight T2V framework, termed Hummingbird, which prunes existing models and enhances visual quality through visual feedback learning. Our approach reduces the size of the U-Net from 1.4 billion to 0.7 billion parameters, significantly improving efficiency while preserving high-quality video generation. Additionally, we introduce a novel data processing pipeline that leverages Large Language Models (LLMs) and Video Quality Assessment (VQA) models to enhance the quality of both text prompts and video data. To support user-driven training and style customization, we publicly release the full training code, including data processing and model training. Extensive experiments show that our method achieves a 31X speedup compared to state-of-the-art models such as VideoCrafter2, while also attaining the highest overall score on VBench. Moreover, our method supports the generation of videos with up to 26 frames, addressing the limitations of existing U-Net-based methods in long video generation. Notably, the entire training process requires only four GPUs, yet delivers performance competitive with existing leading methods. Hummingbird presents a practical and efficient solution for T2V generation, combining high performance, scalability, and flexibility for real-world applications.

AMD-Hummingbird：迈向高效文本到视频生成模型

AMD-Hummingbird: Towards an Efficient Text-to-Video Model

摘要

Support