AMD-Hummingbird：効率的なテキスト・ツー・ビデオモデルに向けて

要旨

テキストからビデオ（T2V）生成は、テキスト記述から現実的なビデオを合成する能力により、大きな注目を集めています。しかし、既存のモデルは、特にリソースが限られたデバイス（例：iGPUやスマートフォン）において、計算効率と高品質な視覚的クオリティのバランスを取ることに苦戦しています。ほとんどの先行研究は視覚的な忠実度を優先しつつも、実世界での展開に適した小型で効率的なモデルの必要性を見落としています。この課題に対処するため、我々は軽量なT2Vフレームワーク「Hummingbird」を提案します。このフレームワークは既存のモデルを剪定し、視覚的フィードバック学習を通じて視覚的クオリティを向上させます。我々のアプローチにより、U-Netのパラメータ数を14億から7億に削減し、効率を大幅に向上させながらも高品質なビデオ生成を維持します。さらに、大規模言語モデル（LLM）とビデオ品質評価（VQA）モデルを活用した新しいデータ処理パイプラインを導入し、テキストプロンプトとビデオデータの品質を向上させます。ユーザー主導のトレーニングとスタイルのカスタマイズをサポートするため、データ処理とモデルトレーニングを含む完全なトレーニングコードを公開します。広範な実験により、我々の手法はVideoCrafter2などの最先端モデルと比較して31倍の高速化を達成し、VBenchで最高の総合スコアを獲得することが示されました。さらに、我々の手法は最大26フレームのビデオ生成をサポートし、既存のU-Netベースの手法が長いビデオ生成において抱える制限に対処します。特に、トレーニングプロセス全体で4つのGPUのみを必要としながらも、既存の主要な手法と競争力のある性能を提供します。Hummingbirdは、高性能、拡張性、柔軟性を兼ね備えた実用的で効率的なT2V生成ソリューションを提示し、実世界のアプリケーションに適しています。

English

Text-to-Video (T2V) generation has attracted significant attention for its ability to synthesize realistic videos from textual descriptions. However, existing models struggle to balance computational efficiency and high visual quality, particularly on resource-limited devices, e.g.,iGPUs and mobile phones. Most prior work prioritizes visual fidelity while overlooking the need for smaller, more efficient models suitable for real-world deployment. To address this challenge, we propose a lightweight T2V framework, termed Hummingbird, which prunes existing models and enhances visual quality through visual feedback learning. Our approach reduces the size of the U-Net from 1.4 billion to 0.7 billion parameters, significantly improving efficiency while preserving high-quality video generation. Additionally, we introduce a novel data processing pipeline that leverages Large Language Models (LLMs) and Video Quality Assessment (VQA) models to enhance the quality of both text prompts and video data. To support user-driven training and style customization, we publicly release the full training code, including data processing and model training. Extensive experiments show that our method achieves a 31X speedup compared to state-of-the-art models such as VideoCrafter2, while also attaining the highest overall score on VBench. Moreover, our method supports the generation of videos with up to 26 frames, addressing the limitations of existing U-Net-based methods in long video generation. Notably, the entire training process requires only four GPUs, yet delivers performance competitive with existing leading methods. Hummingbird presents a practical and efficient solution for T2V generation, combining high performance, scalability, and flexibility for real-world applications.

AMD-Hummingbird：効率的なテキスト・ツー・ビデオモデルに向けて

AMD-Hummingbird: Towards an Efficient Text-to-Video Model

要旨

Support