ChatPaper.aiChatPaper

YoCausal:從因果視角看,影片生成離世界模型還有多遠?

YoCausal: How Far is Video Generation from World Model? A Causality Perspective

May 28, 2026
作者: You-Zhe Xie, Yu-Hsuan Li, Jie-Ying Lee, Kaipeng Zhang, Yu-Lun Liu, Zhixiang Wang
cs.AI

摘要

隨著影片擴散模型(VDMs)逐步邁向世界模型,一個關鍵問題隨之浮現:它們是否真正理解因果關係,抑或僅僅是過度擬合了統計上的時間模式?現有的基準測試多仰賴合成數據,因模擬到現實的鴻溝(sim-to-real gap),限制了其在真實世界中的泛化能力。我們提出YoCausal,這是一個借鑑認知科學中「預期違背」(Violation of Expectation, VoE)範式的雙層級基準測試。透過零成本地將真實世界影片進行時間反轉,作為自然的反事實樣本,YoCausal建立了一個可任意擴展的評估協議。第一層級引入「反轉驚奇指數」(Reverse Surprise Index, RSI),透過去噪損失量化時間箭頭感知。第二層級引入「因果認知指數」(Causality Cognition Index, CCI),利用視覺語言模型(VLM)將數據集分層為因果與非因果子集,從而將真正的因果推理與時間偏誤區分開來。對13個最新VDMs的評估結果顯示,感知時間箭頭並不意味著理解因果關係,且與人類層級的因果認知之間仍存在顯著差距。
English
As video diffusion models (VDMs) advance toward world models, a key question arises: do they truly understand causality, or merely overfit to statistical temporal patterns? Existing benchmarks mostly rely on synthetic data, limiting real-world generalization due to the sim-to-real gap. We present YoCausal, a two-level benchmark inspired by the Violation of Expectation (VoE) paradigm from cognitive science. By temporally reversing real-world videos at zero cost as natural counterfactual samples, YoCausal establishes an arbitrarily extensible evaluation protocol. Level 1 introduces the Reverse Surprise Index (RSI), quantifying arrow-of-time perception via denoising loss. Level 2 introduces the Causality Cognition Index (CCI), which leverages a VLM to stratify datasets into causal and non-causal subsets, disentangling genuine causal reasoning from temporal bias. Evaluation of 13 state-of-the-art VDMs reveals that perceiving the arrow of time does not imply understanding causality, and a significant gap persists relative to human-level causal cognition.