Mask^2DiT：基于双掩码的扩散Transformer用于多场景长视频生成

摘要

Sora展现了扩散变换器（DiT）架构在单场景视频生成中的巨大潜力。然而，更具挑战性且应用范围更广的多场景视频生成任务仍相对未被充分探索。为填补这一空白，我们提出了Mask^2DiT，一种新颖的方法，它在视频片段与其对应文本注释之间建立了细粒度的一一对应关系。具体而言，我们在DiT架构的每一注意力层引入对称二进制掩码，确保每个文本注释仅应用于其对应的视频片段，同时保持视觉标记间的时间连贯性。这一注意力机制实现了精确的片段级文本到视觉对齐，使DiT架构能够有效处理固定场景数的视频生成任务。为进一步赋予DiT架构基于现有场景生成额外场景的能力，我们引入了片段级条件掩码，该掩码使每个新生成的片段依赖于先前的视频片段，从而实现自回归场景扩展。定性与定量实验均证实，Mask^2DiT在保持片段间视觉一致性的同时，确保了每个片段与其对应文本描述的语义对齐。我们的项目页面为https://tianhao-qi.github.io/Mask2DiTProject。

English

Sora has unveiled the immense potential of the Diffusion Transformer (DiT) architecture in single-scene video generation. However, the more challenging task of multi-scene video generation, which offers broader applications, remains relatively underexplored. To bridge this gap, we propose Mask^2DiT, a novel approach that establishes fine-grained, one-to-one alignment between video segments and their corresponding text annotations. Specifically, we introduce a symmetric binary mask at each attention layer within the DiT architecture, ensuring that each text annotation applies exclusively to its respective video segment while preserving temporal coherence across visual tokens. This attention mechanism enables precise segment-level textual-to-visual alignment, allowing the DiT architecture to effectively handle video generation tasks with a fixed number of scenes. To further equip the DiT architecture with the ability to generate additional scenes based on existing ones, we incorporate a segment-level conditional mask, which conditions each newly generated segment on the preceding video segments, thereby enabling auto-regressive scene extension. Both qualitative and quantitative experiments confirm that Mask^2DiT excels in maintaining visual consistency across segments while ensuring semantic alignment between each segment and its corresponding text description. Our project page is https://tianhao-qi.github.io/Mask2DiTProject.