CRONOS：视频模型中反事实物理一致性的基准测试

摘要

视频预测日益被视为迈向通用世界模型的一条路径，然而目前尚不明确这些系统是学习了底层的因果结构，还是仅仅利用了表面的视觉相关性来进行未来预测。我们提出了CRONOS——一个基于干预的基准测试，旨在评估反事实物理一致性：即模型对物理事件的预测是否会根据视觉输入中的受控变化（例如场景上下文、视角、物体外观和物体类别的变化）做出适当响应。CRONOS基于照片级逼真的Unreal Engine环境构建，能够跨不同场景和动力学过程生成受控的高保真视频。与以往的基准测试不同，CRONOS系统性地对四个关键因素——视角、场景、物体类别和物体外观——进行干预，同时保持潜在的物理事件类型（如碰撞、遮挡或坠落）不变。我们针对近期开源视频生成器的评估显示，它们在反事实物理一致性方面存在重大缺陷：同一物理事件类型的预测质量会受到外观、环境，尤其是视角变化的影响。CRONOS提供了一个可控且可重复的测试平台，用于诊断不同干预下生成视频质量的变化方式，从而为开发能在多种条件变化下保持性能一致的模型确立了具体目标。该数据集和代码可在我们的项目页面上获取。

English

Video prediction is increasingly viewed as a path toward generalizable world models, yet it remains unclear whether these systems learn underlying causal structure or merely exploit superficial visual correlations for future prediction. We introduce CRONOS, an intervention-based benchmark designed to evaluate counterfactual physical consistency: whether a model's predictions of physical events respond appropriately to controlled changes in the visual input, such as variations of scene context, viewpoint, object appearance, and object category. Built in a photorealistic Unreal Engine environment, CRONOS enables controlled, high-fidelity generation of videos across diverse scenes and dynamics. In contrast to previous benchmarks, CRONOS systematically intervenes on four key factors - viewpoint, scene, object category, and object appearance - while keeping the underlying physical event type, such as a collision, occlusion, or fall, fixed. Our evaluation of recent open-source video generators reveals substantial failures in counterfactual physical consistency: prediction quality for the same physical event type is affected by appearance, environment, and, particularly by viewpoint changes. CRONOS provides a controlled and reproducible testbed for diagnosing how the quality of generated videos changes for different interventions, establishing a concrete target for developing models that perform consistently across changes of multiple conditions. The dataset and code are available at our project page.