MATRIX: 상호작용 인식 비디오 생성을 위한 마스크 트랙 정렬

초록

Video DiT는 비디오 생성 분야에서 진전을 이루었지만, 여전히 다중 인스턴스나 주체-객체 상호작용을 모델링하는 데 어려움을 겪고 있습니다. 이는 중요한 질문을 제기합니다: 이러한 모델들은 내부적으로 상호작용을 어떻게 표현할까요? 이를 해결하기 위해, 우리는 상호작용을 인지한 캡션과 다중 인스턴스 마스크 트랙을 포함한 MATRIX-11K 비디오 데이터셋을 구축했습니다. 이 데이터셋을 사용하여, 우리는 Video DiT의 두 가지 관점을 체계적으로 분석합니다: 첫째, 비디오-텍스트 어텐션을 통한 의미론적 접지(semantic grounding)로, 명사와 동사 토큰이 인스턴스와 그 관계를 포착하는지 평가합니다. 둘째, 비디오-비디오 어텐션을 통한 의미론적 전파(semantic propagation)로, 인스턴스 바인딩이 프레임 간에 지속되는지 평가합니다. 우리는 이 두 효과가 상호작용이 지배적인 소수의 레이어에 집중되어 있음을 발견했습니다. 이를 바탕으로, 우리는 MATRIX를 제안합니다. 이는 Video DiT의 특정 레이어에서의 어텐션을 MATRIX-11K 데이터셋의 다중 인스턴스 마스크 트랙과 정렬함으로써 접지와 전파를 모두 향상시키는 간단하면서도 효과적인 정규화 방법입니다. 또한, 우리는 상호작용을 인지한 비디오 생성을 평가하기 위한 InterGenEval 평가 프로토콜을 제안합니다. 실험에서 MATRIX는 상호작용 충실도와 의미론적 정렬을 모두 개선하면서 드리프트(drift)와 환각(hallucination)을 줄였습니다. 광범위한 어블레이션(ablation) 실험을 통해 우리의 설계 선택을 검증했습니다. 코드와 가중치는 공개될 예정입니다.

English

Video DiTs have advanced video generation, yet they still struggle to model multi-instance or subject-object interactions. This raises a key question: How do these models internally represent interactions? To answer this, we curate MATRIX-11K, a video dataset with interaction-aware captions and multi-instance mask tracks. Using this dataset, we conduct a systematic analysis that formalizes two perspectives of video DiTs: semantic grounding, via video-to-text attention, which evaluates whether noun and verb tokens capture instances and their relations; and semantic propagation, via video-to-video attention, which assesses whether instance bindings persist across frames. We find both effects concentrate in a small subset of interaction-dominant layers. Motivated by this, we introduce MATRIX, a simple and effective regularization that aligns attention in specific layers of video DiTs with multi-instance mask tracks from the MATRIX-11K dataset, enhancing both grounding and propagation. We further propose InterGenEval, an evaluation protocol for interaction-aware video generation. In experiments, MATRIX improves both interaction fidelity and semantic alignment while reducing drift and hallucination. Extensive ablations validate our design choices. Codes and weights will be released.

MATRIX: 상호작용 인식 비디오 생성을 위한 마스크 트랙 정렬

MATRIX: Mask Track Alignment for Interaction-aware Video Generation

초록

Support