AMD-Hummingbird: 효율적인 텍스트-투-비디오 모델을 향하여

초록

텍스트-투-비디오(T2V) 생성은 텍스트 설명으로부터 사실적인 비디오를 합성할 수 있는 능력으로 인해 상당한 관심을 받고 있습니다. 그러나 기존 모델들은 특히 iGPU와 휴대폰과 같은 리소스가 제한된 장치에서 계산 효율성과 높은 시각적 품질 사이의 균형을 맞추는 데 어려움을 겪고 있습니다. 대부분의 기존 연구는 시각적 충실도를 우선시하면서 실제 배포에 적합한 더 작고 효율적인 모델의 필요성을 간과했습니다. 이러한 문제를 해결하기 위해, 우리는 경량화된 T2V 프레임워크인 Hummingbird를 제안합니다. 이 프레임워크는 기존 모델을 정제하고 시각적 피드백 학습을 통해 시각적 품질을 향상시킵니다. 우리의 접근 방식은 U-Net의 크기를 14억 개에서 7억 개의 파라미터로 줄여 고품질 비디오 생성을 유지하면서도 효율성을 크게 개선했습니다. 또한, 우리는 대형 언어 모델(LLM)과 비디오 품질 평가(VQA) 모델을 활용하여 텍스트 프롬프트와 비디오 데이터의 품질을 향상시키는 새로운 데이터 처리 파이프라인을 도입했습니다. 사용자 주도 학습 및 스타일 맞춤화를 지원하기 위해, 데이터 처리 및 모델 학습을 포함한 전체 학습 코드를 공개했습니다. 광범위한 실험 결과, 우리의 방법은 VideoCrafter2와 같은 최첨단 모델에 비해 31배의 속도 향상을 달성했으며, VBench에서도 최고의 종합 점수를 기록했습니다. 또한, 우리의 방법은 최대 26프레임의 비디오 생성을 지원하여 기존 U-Net 기반 방법의 장편 비디오 생성의 한계를 해결했습니다. 특히, 전체 학습 과정은 단 4개의 GPU만 필요하지만 기존의 선도적인 방법들과 경쟁력 있는 성능을 제공합니다. Hummingbird는 고성능, 확장성, 그리고 실제 애플리케이션을 위한 유연성을 결합한 T2V 생성을 위한 실용적이고 효율적인 솔루션을 제시합니다.

English

Text-to-Video (T2V) generation has attracted significant attention for its ability to synthesize realistic videos from textual descriptions. However, existing models struggle to balance computational efficiency and high visual quality, particularly on resource-limited devices, e.g.,iGPUs and mobile phones. Most prior work prioritizes visual fidelity while overlooking the need for smaller, more efficient models suitable for real-world deployment. To address this challenge, we propose a lightweight T2V framework, termed Hummingbird, which prunes existing models and enhances visual quality through visual feedback learning. Our approach reduces the size of the U-Net from 1.4 billion to 0.7 billion parameters, significantly improving efficiency while preserving high-quality video generation. Additionally, we introduce a novel data processing pipeline that leverages Large Language Models (LLMs) and Video Quality Assessment (VQA) models to enhance the quality of both text prompts and video data. To support user-driven training and style customization, we publicly release the full training code, including data processing and model training. Extensive experiments show that our method achieves a 31X speedup compared to state-of-the-art models such as VideoCrafter2, while also attaining the highest overall score on VBench. Moreover, our method supports the generation of videos with up to 26 frames, addressing the limitations of existing U-Net-based methods in long video generation. Notably, the entire training process requires only four GPUs, yet delivers performance competitive with existing leading methods. Hummingbird presents a practical and efficient solution for T2V generation, combining high performance, scalability, and flexibility for real-world applications.

AMD-Hummingbird: 효율적인 텍스트-투-비디오 모델을 향하여

AMD-Hummingbird: Towards an Efficient Text-to-Video Model

초록

Support