FullDiT: 전체 어텐션을 활용한 다중 작업 비디오 생성 기반 모델

초록

현재의 비디오 생성 기반 모델은 주로 텍스트-투-비디오 작업에 초점을 맞추고 있어, 세밀한 비디오 콘텐츠 생성에 대한 제어가 제한적입니다. 어댑터 기반 접근법(예: ControlNet)은 최소한의 미세 조정으로 추가적인 제어를 가능하게 하지만, 여러 조건을 통합할 때 다음과 같은 문제에 직면합니다: 독립적으로 훈련된 어댑터 간의 분기 충돌, 매개변수 중복으로 인한 계산 비용 증가, 그리고 전체 미세 조정에 비해 낮은 성능. 이러한 문제를 해결하기 위해, 우리는 FullDiT를 소개합니다. FullDiT는 통합된 전체-어텐션 메커니즘을 통해 여러 조건을 원활하게 통합하는 비디오 생성을 위한 통합 기반 모델입니다. 다중 작업 조건을 통합된 시퀀스 표현으로 융합하고, 전체 자기-어텐션의 장문맥 학습 능력을 활용하여 조건 동역학을 포착함으로써, FullDiT는 매개변수 오버헤드를 줄이고 조건 충돌을 방지하며 확장성과 창발적 능력을 보여줍니다. 또한, 우리는 다중 작업 비디오 생성을 평가하기 위한 FullBench를 소개합니다. 실험 결과, FullDiT는 복잡한 다중 작업 비디오 생성에서 전체-어텐션의 효율성을 입증하며 최첨단 결과를 달성했습니다.

English

Current video generative foundation models primarily focus on text-to-video tasks, providing limited control for fine-grained video content creation. Although adapter-based approaches (e.g., ControlNet) enable additional controls with minimal fine-tuning, they encounter challenges when integrating multiple conditions, including: branch conflicts between independently trained adapters, parameter redundancy leading to increased computational cost, and suboptimal performance compared to full fine-tuning. To address these challenges, we introduce FullDiT, a unified foundation model for video generation that seamlessly integrates multiple conditions via unified full-attention mechanisms. By fusing multi-task conditions into a unified sequence representation and leveraging the long-context learning ability of full self-attention to capture condition dynamics, FullDiT reduces parameter overhead, avoids conditions conflict, and shows scalability and emergent ability. We further introduce FullBench for multi-task video generation evaluation. Experiments demonstrate that FullDiT achieves state-of-the-art results, highlighting the efficacy of full-attention in complex multi-task video generation.