矩陣:面向互動感知視頻生成的遮罩軌跡對齊技術
MATRIX: Mask Track Alignment for Interaction-aware Video Generation
October 8, 2025
作者: Siyoon Jin, Seongchan Kim, Dahyun Chung, Jaeho Lee, Hyunwook Choi, Jisu Nam, Jiyoung Kim, Seungryong Kim
cs.AI
摘要
視頻DiTs在視頻生成方面取得了進展,但它們在建模多實例或主客體互動方面仍面臨挑戰。這引發了一個關鍵問題:這些模型內部如何表徵互動?為解答此問題,我們精心構建了MATRIX-11K,這是一個包含互動感知字幕和多實例遮罩軌跡的視頻數據集。利用該數據集,我們進行了系統分析,形式化了視頻DiTs的兩個視角:語義基礎,通過視頻到文本的注意力機制,評估名詞和動詞詞彙是否捕捉到實例及其關係;以及語義傳播,通過視頻到視頻的注意力機制,評估實例綁定是否跨幀持續。我們發現這兩種效應集中於一小部分互動主導層中。基於此,我們引入了MATRIX,這是一種簡單而有效的正則化方法,它將視頻DiTs特定層的注意力與MATRIX-11K數據集中的多實例遮罩軌跡對齊,從而增強了基礎和傳播。我們進一步提出了InterGenEval,一個用於互動感知視頻生成的評估協議。在實驗中,MATRIX提升了互動保真度和語義對齊,同時減少了漂移和幻覺。廣泛的消融實驗驗證了我們的設計選擇。代碼和權重將被公開。
English
Video DiTs have advanced video generation, yet they still struggle to model
multi-instance or subject-object interactions. This raises a key question: How
do these models internally represent interactions? To answer this, we curate
MATRIX-11K, a video dataset with interaction-aware captions and multi-instance
mask tracks. Using this dataset, we conduct a systematic analysis that
formalizes two perspectives of video DiTs: semantic grounding, via
video-to-text attention, which evaluates whether noun and verb tokens capture
instances and their relations; and semantic propagation, via video-to-video
attention, which assesses whether instance bindings persist across frames. We
find both effects concentrate in a small subset of interaction-dominant layers.
Motivated by this, we introduce MATRIX, a simple and effective regularization
that aligns attention in specific layers of video DiTs with multi-instance mask
tracks from the MATRIX-11K dataset, enhancing both grounding and propagation.
We further propose InterGenEval, an evaluation protocol for interaction-aware
video generation. In experiments, MATRIX improves both interaction fidelity and
semantic alignment while reducing drift and hallucination. Extensive ablations
validate our design choices. Codes and weights will be released.