Mask^2DiT：基於雙重遮罩的擴散Transformer用於多場景長視頻生成

摘要

Sora揭示了擴散變換器（DiT）架構在單場景視頻生成中的巨大潛力。然而，更具挑戰性且應用更廣泛的多場景視頻生成任務仍相對未被充分探索。為填補這一空白，我們提出了Mask^2DiT，這是一種新穎的方法，能在視頻片段與其對應的文本註釋之間建立細粒度的一對一對齊。具體而言，我們在DiT架構的每個注意力層引入對稱二值掩碼，確保每個文本註釋僅應用於其相應的視頻片段，同時保持視覺標記間的時序一致性。這種注意力機制實現了精確的片段級文本到視覺對齊，使DiT架構能有效處理固定場景數的視頻生成任務。為了進一步賦予DiT架構基於現有場景生成額外場景的能力，我們整合了片段級條件掩碼，該掩碼使每個新生成的片段都基於先前的視頻片段，從而實現自回歸場景擴展。定性和定量實驗均證實，Mask^2DiT在保持跨片段視覺一致性的同時，確保了每個片段與其對應文本描述的語義對齊。我們的項目頁面為https://tianhao-qi.github.io/Mask2DiTProject。

English

Sora has unveiled the immense potential of the Diffusion Transformer (DiT) architecture in single-scene video generation. However, the more challenging task of multi-scene video generation, which offers broader applications, remains relatively underexplored. To bridge this gap, we propose Mask^2DiT, a novel approach that establishes fine-grained, one-to-one alignment between video segments and their corresponding text annotations. Specifically, we introduce a symmetric binary mask at each attention layer within the DiT architecture, ensuring that each text annotation applies exclusively to its respective video segment while preserving temporal coherence across visual tokens. This attention mechanism enables precise segment-level textual-to-visual alignment, allowing the DiT architecture to effectively handle video generation tasks with a fixed number of scenes. To further equip the DiT architecture with the ability to generate additional scenes based on existing ones, we incorporate a segment-level conditional mask, which conditions each newly generated segment on the preceding video segments, thereby enabling auto-regressive scene extension. Both qualitative and quantitative experiments confirm that Mask^2DiT excels in maintaining visual consistency across segments while ensuring semantic alignment between each segment and its corresponding text description. Our project page is https://tianhao-qi.github.io/Mask2DiTProject.