矩陣：面向互動感知視頻生成的遮罩軌跡對齊技術

摘要

視頻DiTs在視頻生成方面取得了進展，但它們在建模多實例或主客體互動方面仍面臨挑戰。這引發了一個關鍵問題：這些模型內部如何表徵互動？為解答此問題，我們精心構建了MATRIX-11K，這是一個包含互動感知字幕和多實例遮罩軌跡的視頻數據集。利用該數據集，我們進行了系統分析，形式化了視頻DiTs的兩個視角：語義基礎，通過視頻到文本的注意力機制，評估名詞和動詞詞彙是否捕捉到實例及其關係；以及語義傳播，通過視頻到視頻的注意力機制，評估實例綁定是否跨幀持續。我們發現這兩種效應集中於一小部分互動主導層中。基於此，我們引入了MATRIX，這是一種簡單而有效的正則化方法，它將視頻DiTs特定層的注意力與MATRIX-11K數據集中的多實例遮罩軌跡對齊，從而增強了基礎和傳播。我們進一步提出了InterGenEval，一個用於互動感知視頻生成的評估協議。在實驗中，MATRIX提升了互動保真度和語義對齊，同時減少了漂移和幻覺。廣泛的消融實驗驗證了我們的設計選擇。代碼和權重將被公開。

English

Video DiTs have advanced video generation, yet they still struggle to model multi-instance or subject-object interactions. This raises a key question: How do these models internally represent interactions? To answer this, we curate MATRIX-11K, a video dataset with interaction-aware captions and multi-instance mask tracks. Using this dataset, we conduct a systematic analysis that formalizes two perspectives of video DiTs: semantic grounding, via video-to-text attention, which evaluates whether noun and verb tokens capture instances and their relations; and semantic propagation, via video-to-video attention, which assesses whether instance bindings persist across frames. We find both effects concentrate in a small subset of interaction-dominant layers. Motivated by this, we introduce MATRIX, a simple and effective regularization that aligns attention in specific layers of video DiTs with multi-instance mask tracks from the MATRIX-11K dataset, enhancing both grounding and propagation. We further propose InterGenEval, an evaluation protocol for interaction-aware video generation. In experiments, MATRIX improves both interaction fidelity and semantic alignment while reducing drift and hallucination. Extensive ablations validate our design choices. Codes and weights will be released.

矩陣：面向互動感知視頻生成的遮罩軌跡對齊技術

MATRIX: Mask Track Alignment for Interaction-aware Video Generation

摘要

Support