MATRIX:面向交互感知视频生成的掩码轨迹对齐
MATRIX: Mask Track Alignment for Interaction-aware Video Generation
October 8, 2025
作者: Siyoon Jin, Seongchan Kim, Dahyun Chung, Jaeho Lee, Hyunwook Choi, Jisu Nam, Jiyoung Kim, Seungryong Kim
cs.AI
摘要
视频扩散变换器(Video DiTs)在视频生成领域取得了显著进展,然而它们仍难以有效建模多实例或主体-客体间的交互关系。这引发了一个核心问题:这些模型内部是如何表征交互的?为解答此问题,我们精心构建了MATRIX-11K,一个包含交互感知标注和多实例掩码轨迹的视频数据集。基于此数据集,我们进行了系统性分析,从两个视角形式化地审视了视频DiTs:一是通过视频到文本的注意力机制评估语义基础,即名词和动词词元是否捕捉到了实例及其关系;二是通过视频到视频的注意力机制考察语义传播,即实例绑定是否在帧间持续存在。研究发现,这两种效应均集中于少数以交互为主导的层中。受此启发,我们提出了MATRIX,一种简单而有效的正则化方法,它通过将视频DiTs特定层的注意力与MATRIX-11K数据集中的多实例掩码轨迹对齐,从而增强了语义基础和传播。此外,我们还提出了InterGenEval,一个针对交互感知视频生成的评估协议。实验表明,MATRIX在提升交互真实性和语义对齐的同时,减少了漂移和幻觉现象。大量消融实验验证了我们的设计选择。代码和权重将予以公开。
English
Video DiTs have advanced video generation, yet they still struggle to model
multi-instance or subject-object interactions. This raises a key question: How
do these models internally represent interactions? To answer this, we curate
MATRIX-11K, a video dataset with interaction-aware captions and multi-instance
mask tracks. Using this dataset, we conduct a systematic analysis that
formalizes two perspectives of video DiTs: semantic grounding, via
video-to-text attention, which evaluates whether noun and verb tokens capture
instances and their relations; and semantic propagation, via video-to-video
attention, which assesses whether instance bindings persist across frames. We
find both effects concentrate in a small subset of interaction-dominant layers.
Motivated by this, we introduce MATRIX, a simple and effective regularization
that aligns attention in specific layers of video DiTs with multi-instance mask
tracks from the MATRIX-11K dataset, enhancing both grounding and propagation.
We further propose InterGenEval, an evaluation protocol for interaction-aware
video generation. In experiments, MATRIX improves both interaction fidelity and
semantic alignment while reducing drift and hallucination. Extensive ablations
validate our design choices. Codes and weights will be released.