MATRIX：面向交互感知视频生成的掩码轨迹对齐

摘要

视频扩散变换器（Video DiTs）在视频生成领域取得了显著进展，然而它们仍难以有效建模多实例或主体-客体间的交互关系。这引发了一个核心问题：这些模型内部是如何表征交互的？为解答此问题，我们精心构建了MATRIX-11K，一个包含交互感知标注和多实例掩码轨迹的视频数据集。基于此数据集，我们进行了系统性分析，从两个视角形式化地审视了视频DiTs：一是通过视频到文本的注意力机制评估语义基础，即名词和动词词元是否捕捉到了实例及其关系；二是通过视频到视频的注意力机制考察语义传播，即实例绑定是否在帧间持续存在。研究发现，这两种效应均集中于少数以交互为主导的层中。受此启发，我们提出了MATRIX，一种简单而有效的正则化方法，它通过将视频DiTs特定层的注意力与MATRIX-11K数据集中的多实例掩码轨迹对齐，从而增强了语义基础和传播。此外，我们还提出了InterGenEval，一个针对交互感知视频生成的评估协议。实验表明，MATRIX在提升交互真实性和语义对齐的同时，减少了漂移和幻觉现象。大量消融实验验证了我们的设计选择。代码和权重将予以公开。

English

Video DiTs have advanced video generation, yet they still struggle to model multi-instance or subject-object interactions. This raises a key question: How do these models internally represent interactions? To answer this, we curate MATRIX-11K, a video dataset with interaction-aware captions and multi-instance mask tracks. Using this dataset, we conduct a systematic analysis that formalizes two perspectives of video DiTs: semantic grounding, via video-to-text attention, which evaluates whether noun and verb tokens capture instances and their relations; and semantic propagation, via video-to-video attention, which assesses whether instance bindings persist across frames. We find both effects concentrate in a small subset of interaction-dominant layers. Motivated by this, we introduce MATRIX, a simple and effective regularization that aligns attention in specific layers of video DiTs with multi-instance mask tracks from the MATRIX-11K dataset, enhancing both grounding and propagation. We further propose InterGenEval, an evaluation protocol for interaction-aware video generation. In experiments, MATRIX improves both interaction fidelity and semantic alignment while reducing drift and hallucination. Extensive ablations validate our design choices. Codes and weights will be released.

MATRIX：面向交互感知视频生成的掩码轨迹对齐

MATRIX: Mask Track Alignment for Interaction-aware Video Generation

摘要

Support