ChatPaper.aiChatPaper

FullDiT:具備全注意力機制的多任務視頻生成基礎模型

FullDiT: Multi-Task Video Generative Foundation Model with Full Attention

March 25, 2025
作者: Xuan Ju, Weicai Ye, Quande Liu, Qiulin Wang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Qiang Xu
cs.AI

摘要

現有的視頻生成基礎模型主要專注於文本到視頻的任務,在細粒度視頻內容創作方面提供的控制能力有限。儘管基於適配器的方法(如ControlNet)能夠以最小的微調實現額外的控制,但在整合多種條件時仍面臨挑戰,包括:獨立訓練的適配器之間的分支衝突、參數冗餘導致計算成本增加,以及與全面微調相比性能欠佳。為解決這些挑戰,我們引入了FullDiT,這是一個用於視頻生成的統一基礎模型,通過統一的全注意力機制無縫整合多種條件。通過將多任務條件融合為統一的序列表示,並利用全自注意力的長上下文學習能力來捕捉條件動態,FullDiT減少了參數開銷,避免了條件衝突,並展現了可擴展性和湧現能力。我們進一步引入了FullBench用於多任務視頻生成的評估。實驗表明,FullDiT取得了最先進的成果,凸顯了全注意力在複雜多任務視頻生成中的有效性。
English
Current video generative foundation models primarily focus on text-to-video tasks, providing limited control for fine-grained video content creation. Although adapter-based approaches (e.g., ControlNet) enable additional controls with minimal fine-tuning, they encounter challenges when integrating multiple conditions, including: branch conflicts between independently trained adapters, parameter redundancy leading to increased computational cost, and suboptimal performance compared to full fine-tuning. To address these challenges, we introduce FullDiT, a unified foundation model for video generation that seamlessly integrates multiple conditions via unified full-attention mechanisms. By fusing multi-task conditions into a unified sequence representation and leveraging the long-context learning ability of full self-attention to capture condition dynamics, FullDiT reduces parameter overhead, avoids conditions conflict, and shows scalability and emergent ability. We further introduce FullBench for multi-task video generation evaluation. Experiments demonstrate that FullDiT achieves state-of-the-art results, highlighting the efficacy of full-attention in complex multi-task video generation.

Summary

AI-Generated Summary

PDF82March 26, 2025