YoCausal: 비디오 생성은 세계 모델에서 얼마나 떨어져 있는가? 인과성 관점

초록

비디오 확산 모델(VDM)이 월드 모델로 발전함에 따라, 이들이 인과성을 진정으로 이해하는지 아니면 단순히 통계적 시간 패턴에 과적합되는지에 대한 핵심 질문이 제기된다. 기존 벤치마크는 주로 합성 데이터에 의존하여 시뮬레이션-현실 간극(sim-to-real gap)으로 인해 실제 세계 일반화 능력이 제한적이다. 본 연구에서는 인지과학의 기대 위반(Violation of Expectation, VoE) 패러다임에서 영감을 받은 두 수준의 벤치마크인 YoCausal을 제시한다. 실제 세계 비디오를 제로 비용으로 시간적으로 역전시켜 자연스러운 반사실적 샘플로 활용함으로써, YoCausal은 임의로 확장 가능한 평가 프로토콜을 구축한다. 1단계에서는 역전 서프라이즈 지수(Reverse Surprise Index, RSI)를 도입하여 잡음 제거 손실을 통해 시간 방향성(time arrow) 인식을 정량화한다. 2단계에서는 인과성 인지 지수(Causality Cognition Index, CCI)를 도입하여 비전-언어 모델(VLM)을 활용해 데이터셋을 인과 하위 집합과 비인과 하위 집합으로 계층화함으로써, 진정한 인과 추론과 시간적 편향을 분리한다. 최신 VDM 13개를 평가한 결과, 시간 방향성을 인식하는 것이 인과성을 이해함을 의미하지 않으며, 인간 수준의 인과 인지와는 여전히 상당한 격차가 존재함이 드러났다.

English

As video diffusion models (VDMs) advance toward world models, a key question arises: do they truly understand causality, or merely overfit to statistical temporal patterns? Existing benchmarks mostly rely on synthetic data, limiting real-world generalization due to the sim-to-real gap. We present YoCausal, a two-level benchmark inspired by the Violation of Expectation (VoE) paradigm from cognitive science. By temporally reversing real-world videos at zero cost as natural counterfactual samples, YoCausal establishes an arbitrarily extensible evaluation protocol. Level 1 introduces the Reverse Surprise Index (RSI), quantifying arrow-of-time perception via denoising loss. Level 2 introduces the Causality Cognition Index (CCI), which leverages a VLM to stratify datasets into causal and non-causal subsets, disentangling genuine causal reasoning from temporal bias. Evaluation of 13 state-of-the-art VDMs reveals that perceiving the arrow of time does not imply understanding causality, and a significant gap persists relative to human-level causal cognition.