現在の世界モデルは持続的状態コアを欠いている

要旨

世界モデルは、汎用人工知能への決定的な一歩としてますます認識されている。しかし、物理世界をモデル化するには、要求に応じて説得力のあるフレームを描画するだけでは不十分であり、観測から切り離された、時間とともに進化し続ける内部世界状態が必要となる。これにより、カメラが見ているかどうかに関わらず、物体は持続し、出来事はその結末に至る。ちょうど、誰も見ていなくても月が軌道を守るのと同様である。この要件は、既存のベンチマークの盲点であり、それらは忠実度、動き、カメラ制御可能性などの表面的な特性を評価する一方で、生成された世界が観測されなくなった後も進化し続けるかどうかを問うことは決してない。我々は、カメラ運動を観測可能性への介入として扱い、評価を人間に較正された連鎖へと分解する、初の体系的な診断ベンチマークであるWRBenchを導入する。その連鎖では、カメラが要求された操作を実行するか、視野内にある間シーンが連続性と識別可能性を保つか、そして戻ってくる対象が開始された出来事と整合しているかが問われる。4つの制御パラダイムにわたる23モデルからの9,600本の動画を調査した結果、1つの知見が頑強に示される。すなわち、現行システムは観測された世界を追跡ショットとして維持し、戻ってくる対象を、それが見過ごされている間に出来事を進展させるのではなく、放棄された時点の状態で再開するのである。この失敗が制御パラダイム、モデルファミリー、スケールの増分を超えて再発するため、頑強な世界状態の進化は、より鮮明な画像、より厳密な制御、より豊かな幾何学的先験知識、あるいは単なるパラメータ数からはもたらされない。したがって、我々は、物理状態カーネルの安定性と視点介入下での世界線の一致性が、世界モデル設計の第一級の目的となるべきであり、それにより世界モデルが次のフレームがどのように見えるかではなく、世界がどのように展開するかを捉えるようになると主張する。

English

World models are increasingly regarded as a decisive step toward artificial general intelligence, yet modeling the physical world demands more than rendering convincing frames on demand: it requires an internal world state that keeps evolving over time, decoupled from observation, so that objects endure and events run to their conclusions whether or not a camera is watching, much as the moon holds to its orbit when no one is looking. This requirement is a blind spot of existing benchmarks, which reward surface properties such as fidelity, motion, and camera controllability while never asking whether a generated world keeps evolving once it is unobserved. We introduce WRBench, the first systematic diagnostic benchmark that treats camera motion as an intervention on observability and resolves evaluation into a human-calibrated chain that asks whether the camera executes the requested interaction, whether the scene stays continuous and identifiable while in view, and whether a returning target remains consistent with the event that was set in motion. Across 9{,}600 videos from 23 models spanning four control paradigms, one finding proves stubborn: current systems maintain the observed world as a tracking shot, resuming a returning target in the state at which it was abandoned rather than advancing the event while it went unseen. Because this failure recurs across control paradigms, model families, and increments of scale, robust world-state evolution does not follow from cleaner imagery, tighter control, richer geometric priors, or sheer parameter count We therefore argue that the stability of the physical state kernel and the consistency of worldlines under viewpoint intervention should become first-class objectives of world-model design, so that a world model captures how the world will unfold rather than how the next frame appears.