YoCausal：從因果視角看，影片生成離世界模型還有多遠？

摘要

隨著影片擴散模型（VDMs）逐步邁向世界模型，一個關鍵問題隨之浮現：它們是否真正理解因果關係，抑或僅僅是過度擬合了統計上的時間模式？現有的基準測試多仰賴合成數據，因模擬到現實的鴻溝（sim-to-real gap），限制了其在真實世界中的泛化能力。我們提出YoCausal，這是一個借鑑認知科學中「預期違背」（Violation of Expectation, VoE）範式的雙層級基準測試。透過零成本地將真實世界影片進行時間反轉，作為自然的反事實樣本，YoCausal建立了一個可任意擴展的評估協議。第一層級引入「反轉驚奇指數」（Reverse Surprise Index, RSI），透過去噪損失量化時間箭頭感知。第二層級引入「因果認知指數」（Causality Cognition Index, CCI），利用視覺語言模型（VLM）將數據集分層為因果與非因果子集，從而將真正的因果推理與時間偏誤區分開來。對13個最新VDMs的評估結果顯示，感知時間箭頭並不意味著理解因果關係，且與人類層級的因果認知之間仍存在顯著差距。

English

As video diffusion models (VDMs) advance toward world models, a key question arises: do they truly understand causality, or merely overfit to statistical temporal patterns? Existing benchmarks mostly rely on synthetic data, limiting real-world generalization due to the sim-to-real gap. We present YoCausal, a two-level benchmark inspired by the Violation of Expectation (VoE) paradigm from cognitive science. By temporally reversing real-world videos at zero cost as natural counterfactual samples, YoCausal establishes an arbitrarily extensible evaluation protocol. Level 1 introduces the Reverse Surprise Index (RSI), quantifying arrow-of-time perception via denoising loss. Level 2 introduces the Causality Cognition Index (CCI), which leverages a VLM to stratify datasets into causal and non-causal subsets, disentangling genuine causal reasoning from temporal bias. Evaluation of 13 state-of-the-art VDMs reveals that perceiving the arrow of time does not imply understanding causality, and a significant gap persists relative to human-level causal cognition.