FullDiT：具备全注意力机制的多任务视频生成基础模型

摘要

当前视频生成基础模型主要集中于文本到视频的任务，在细粒度视频内容创作方面提供的控制较为有限。尽管基于适配器的方法（如ControlNet）通过最小化微调实现了额外控制，但在整合多重条件时仍面临挑战，包括：独立训练的适配器之间的分支冲突、参数冗余导致计算成本增加，以及相比全量微调表现欠佳。为解决这些难题，我们提出了FullDiT，一个统一的基础视频生成模型，它通过统一的全注意力机制无缝整合多重条件。通过将多任务条件融合为统一的序列表示，并利用全自注意力机制的长上下文学习能力捕捉条件动态，FullDiT减少了参数开销，避免了条件冲突，展现了可扩展性和涌现能力。我们还引入了FullBench用于多任务视频生成评估。实验证明，FullDiT取得了最先进的成果，凸显了全注意力机制在复杂多任务视频生成中的有效性。

English

Current video generative foundation models primarily focus on text-to-video tasks, providing limited control for fine-grained video content creation. Although adapter-based approaches (e.g., ControlNet) enable additional controls with minimal fine-tuning, they encounter challenges when integrating multiple conditions, including: branch conflicts between independently trained adapters, parameter redundancy leading to increased computational cost, and suboptimal performance compared to full fine-tuning. To address these challenges, we introduce FullDiT, a unified foundation model for video generation that seamlessly integrates multiple conditions via unified full-attention mechanisms. By fusing multi-task conditions into a unified sequence representation and leveraging the long-context learning ability of full self-attention to capture condition dynamics, FullDiT reduces parameter overhead, avoids conditions conflict, and shows scalability and emergent ability. We further introduce FullBench for multi-task video generation evaluation. Experiments demonstrate that FullDiT achieves state-of-the-art results, highlighting the efficacy of full-attention in complex multi-task video generation.

FullDiT：具备全注意力机制的多任务视频生成基础模型

FullDiT: Multi-Task Video Generative Foundation Model with Full Attention

摘要

Support