AMD-Hummingbird：邁向高效文本到視頻模型

摘要

文本到視頻（T2V）生成技術因其能從文字描述合成逼真視頻而受到廣泛關注。然而，現有模型在計算效率與高視覺質量之間難以取得平衡，尤其是在資源有限的設備上，如集成顯卡（iGPU）和手機。多數先前工作優先考慮視覺逼真度，卻忽視了開發更小、更高效模型以適應實際部署的需求。為應對這一挑戰，我們提出了一種輕量級T2V框架，命名為「蜂鳥」，該框架通過剪枝現有模型並利用視覺反饋學習來提升視覺質量。我們的方法將U-Net的參數規模從14億縮減至7億，顯著提高了效率，同時保持了高質量的視頻生成。此外，我們引入了一種新穎的數據處理流程，利用大型語言模型（LLMs）和視頻質量評估（VQA）模型來提升文本提示和視頻數據的質量。為支持用戶驅動的訓練和風格定制，我們公開了完整的訓練代碼，包括數據處理和模型訓練。大量實驗表明，與VideoCrafter2等頂尖模型相比，我們的方法實現了31倍的加速，並在VBench上獲得了最高總分。此外，我們的方法支持生成最多26幀的視頻，解決了現有基於U-Net方法在長視頻生成上的限制。值得注意的是，整個訓練過程僅需四塊GPU，卻能提供與現有領先方法相媲美的性能。「蜂鳥」為T2V生成提供了一個實用且高效的解決方案，結合了高性能、可擴展性和實際應用的靈活性。

English

Text-to-Video (T2V) generation has attracted significant attention for its ability to synthesize realistic videos from textual descriptions. However, existing models struggle to balance computational efficiency and high visual quality, particularly on resource-limited devices, e.g.,iGPUs and mobile phones. Most prior work prioritizes visual fidelity while overlooking the need for smaller, more efficient models suitable for real-world deployment. To address this challenge, we propose a lightweight T2V framework, termed Hummingbird, which prunes existing models and enhances visual quality through visual feedback learning. Our approach reduces the size of the U-Net from 1.4 billion to 0.7 billion parameters, significantly improving efficiency while preserving high-quality video generation. Additionally, we introduce a novel data processing pipeline that leverages Large Language Models (LLMs) and Video Quality Assessment (VQA) models to enhance the quality of both text prompts and video data. To support user-driven training and style customization, we publicly release the full training code, including data processing and model training. Extensive experiments show that our method achieves a 31X speedup compared to state-of-the-art models such as VideoCrafter2, while also attaining the highest overall score on VBench. Moreover, our method supports the generation of videos with up to 26 frames, addressing the limitations of existing U-Net-based methods in long video generation. Notably, the entire training process requires only four GPUs, yet delivers performance competitive with existing leading methods. Hummingbird presents a practical and efficient solution for T2V generation, combining high performance, scalability, and flexibility for real-world applications.

AMD-Hummingbird：邁向高效文本到視頻模型

AMD-Hummingbird: Towards an Efficient Text-to-Video Model

摘要

Support