ChatPaper.aiChatPaper

FullDiT:具备全注意力机制的多任务视频生成基础模型

FullDiT: Multi-Task Video Generative Foundation Model with Full Attention

March 25, 2025
作者: Xuan Ju, Weicai Ye, Quande Liu, Qiulin Wang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Qiang Xu
cs.AI

摘要

当前视频生成基础模型主要集中于文本到视频的任务,在细粒度视频内容创作方面提供的控制较为有限。尽管基于适配器的方法(如ControlNet)通过最小化微调实现了额外控制,但在整合多重条件时仍面临挑战,包括:独立训练的适配器之间的分支冲突、参数冗余导致计算成本增加,以及相比全量微调表现欠佳。为解决这些难题,我们提出了FullDiT,一个统一的基础视频生成模型,它通过统一的全注意力机制无缝整合多重条件。通过将多任务条件融合为统一的序列表示,并利用全自注意力机制的长上下文学习能力捕捉条件动态,FullDiT减少了参数开销,避免了条件冲突,展现了可扩展性和涌现能力。我们还引入了FullBench用于多任务视频生成评估。实验证明,FullDiT取得了最先进的成果,凸显了全注意力机制在复杂多任务视频生成中的有效性。
English
Current video generative foundation models primarily focus on text-to-video tasks, providing limited control for fine-grained video content creation. Although adapter-based approaches (e.g., ControlNet) enable additional controls with minimal fine-tuning, they encounter challenges when integrating multiple conditions, including: branch conflicts between independently trained adapters, parameter redundancy leading to increased computational cost, and suboptimal performance compared to full fine-tuning. To address these challenges, we introduce FullDiT, a unified foundation model for video generation that seamlessly integrates multiple conditions via unified full-attention mechanisms. By fusing multi-task conditions into a unified sequence representation and leveraging the long-context learning ability of full self-attention to capture condition dynamics, FullDiT reduces parameter overhead, avoids conditions conflict, and shows scalability and emergent ability. We further introduce FullBench for multi-task video generation evaluation. Experiments demonstrate that FullDiT achieves state-of-the-art results, highlighting the efficacy of full-attention in complex multi-task video generation.
PDF82March 26, 2025