Mask^2DiT: マルチシーン長尺動画生成のためのデュアルマスクベース拡散Transformer

要旨

Soraは、単一シーンのビデオ生成におけるDiffusion Transformer（DiT）アーキテクチャの巨大な可能性を明らかにしました。しかし、より広範な応用が可能なマルチシーンビデオ生成というより困難なタスクは、まだ十分に探求されていません。このギャップを埋めるため、我々はMask^2DiTを提案します。これは、ビデオセグメントとそれに対応するテキスト注釈の間に細かい1対1のアラインメントを確立する新しいアプローチです。具体的には、DiTアーキテクチャ内の各アテンションレイヤーに対称的なバイナリマスクを導入し、各テキスト注釈がそれぞれのビデオセグメントにのみ適用されることを保証しながら、視覚トークン間の時間的整合性を維持します。このアテンションメカニズムにより、セグメントレベルのテキストから視覚への正確なアラインメントが可能になり、DiTアーキテクチャが固定数のシーンを持つビデオ生成タスクを効果的に処理できるようになります。さらに、DiTアーキテクチャに既存のシーンに基づいて追加のシーンを生成する能力を付与するため、セグメントレベルの条件付きマスクを組み込みます。これにより、新しく生成される各セグメントが先行するビデオセグメントに条件付けされ、自己回帰的なシーン拡張が可能になります。定性的および定量的な実験の両方で、Mask^2DiTがセグメント間の視覚的一貫性を維持しつつ、各セグメントとそれに対応するテキスト記述の間の意味的アラインメントを確保することに優れていることが確認されました。プロジェクトページはhttps://tianhao-qi.github.io/Mask2DiTProjectです。

English

Sora has unveiled the immense potential of the Diffusion Transformer (DiT) architecture in single-scene video generation. However, the more challenging task of multi-scene video generation, which offers broader applications, remains relatively underexplored. To bridge this gap, we propose Mask^2DiT, a novel approach that establishes fine-grained, one-to-one alignment between video segments and their corresponding text annotations. Specifically, we introduce a symmetric binary mask at each attention layer within the DiT architecture, ensuring that each text annotation applies exclusively to its respective video segment while preserving temporal coherence across visual tokens. This attention mechanism enables precise segment-level textual-to-visual alignment, allowing the DiT architecture to effectively handle video generation tasks with a fixed number of scenes. To further equip the DiT architecture with the ability to generate additional scenes based on existing ones, we incorporate a segment-level conditional mask, which conditions each newly generated segment on the preceding video segments, thereby enabling auto-regressive scene extension. Both qualitative and quantitative experiments confirm that Mask^2DiT excels in maintaining visual consistency across segments while ensuring semantic alignment between each segment and its corresponding text description. Our project page is https://tianhao-qi.github.io/Mask2DiTProject.

Mask^2DiT: マルチシーン長尺動画生成のためのデュアルマスクベース拡散Transformer

Mask^2DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation

要旨

Support