当前世界模型缺乏持久状态核心

摘要

世界模型日益被视为通往通用人工智能的关键一步，但建模物理世界所需的远不止是按需生成令人信服的帧画面：它需要一个随时间不断演化、与观测解耦的内部世界状态，使得物体持续存在、事件完成其进程，无论是否有相机在观看——正如月亮在无人注视时仍沿轨道运行一样。这一要求是现有基准的盲点，它们奖励保真度、运动、相机可控性等表面属性，却从未质询生成的世界在无人观测时是否持续演化。我们提出WRBench——首个将相机运动视为观测性干预的系统性诊断基准，并将评估分解为经人类校准的链条：询问相机是否执行了所要求的交互、场景在视野内是否保持连续且可识别、以及返回的目标是否与已被启动的事件保持一致。覆盖四种控制范式的23个模型生成的9600段视频中，一个发现被证实为顽固性问题：当前系统将观测到的世界维持为跟踪镜头，返回的目标在被遗弃时的状态处恢复，而非在未被观测期间推进事件。由于这一失败在不同控制范式、模型家族和规模增量中反复出现，稳健的世界状态演化并不能通过更清晰的图像、更精准的控制、更丰富的几何先验或单纯的参数量级来实现。因此我们主张，物理状态核的稳定性以及视角干预下世界线的一致性应成为世界模型设计中的首要目标，从而使世界模型捕捉世界将如何展开，而非下一帧如何呈现。

English

World models are increasingly regarded as a decisive step toward artificial general intelligence, yet modeling the physical world demands more than rendering convincing frames on demand: it requires an internal world state that keeps evolving over time, decoupled from observation, so that objects endure and events run to their conclusions whether or not a camera is watching, much as the moon holds to its orbit when no one is looking. This requirement is a blind spot of existing benchmarks, which reward surface properties such as fidelity, motion, and camera controllability while never asking whether a generated world keeps evolving once it is unobserved. We introduce WRBench, the first systematic diagnostic benchmark that treats camera motion as an intervention on observability and resolves evaluation into a human-calibrated chain that asks whether the camera executes the requested interaction, whether the scene stays continuous and identifiable while in view, and whether a returning target remains consistent with the event that was set in motion. Across 9{,}600 videos from 23 models spanning four control paradigms, one finding proves stubborn: current systems maintain the observed world as a tracking shot, resuming a returning target in the state at which it was abandoned rather than advancing the event while it went unseen. Because this failure recurs across control paradigms, model families, and increments of scale, robust world-state evolution does not follow from cleaner imagery, tighter control, richer geometric priors, or sheer parameter count We therefore argue that the stability of the physical state kernel and the consistency of worldlines under viewpoint intervention should become first-class objectives of world-model design, so that a world model captures how the world will unfold rather than how the next frame appears.