현재 세계 모델들은 지속적 상태 코어가 부족하다.

초록

세계 모델은 점점 인공일반지능을 향한 결정적 단계로 간주되고 있지만, 물리적 세계를 모델링하려면 요청 시 설득력 있는 프레임을 생성하는 것 이상이 요구된다. 즉, 관찰과 분리되어 시간이 지남에 따라 계속 진화하는 내부 세계 상태가 필요하며, 이를 통해 카메라가 지켜보고 있지 않을 때에도 마치 아무도 보지 않을 때 달이 궤도를 유지하듯, 객체는 지속되고 사건은 결말까지 진행되어야 한다. 이러한 요구사항은 기존 벤치마크의 사각지대인데, 이들은 충실도, 움직임, 카메라 제어 가능성과 같은 표면적 속성에 보상을 주면서도, 생성된 세계가 관찰되지 않을 때에도 계속 진화하는지 여부는 묻지 않는다. 본 논문에서는 카메라 움직임을 관찰 가능성에 대한 개입으로 취급하고, 평가를 인간이 보정한 체인으로 해소하는 최초의 체계적 진단 벤치마크인 WRBench를 소개한다. 이 체인은 카메라가 요청된 상호작용을 수행하는지, 장면이 보이는 동안 연속성과 식별 가능성을 유지하는지, 그리고 돌아오는 대상이 시작된 사건과 일관성을 유지하는지를 질문한다. 23개 모델(4가지 제어 패러다임)에서 얻은 9,600개의 비디오를 대상으로 한 결과, 한 가지 사실이 완강하게 드러난다. 현재 시스템은 관찰된 세계를 추적 샷으로 유지하며, 돌아오는 대상을 그것이 버려진 상태 그대로 재개할 뿐, 보이지 않는 동안 사건을 진행시키지 않는다. 이 실패는 제어 패러다임, 모델 계열, 규모 증가에 걸쳐 반복적으로 나타나므로, 견고한 세계 상태 진화는 더 깨끗한 이미지, 더 정밀한 제어, 더 풍부한 기하학적 사전 지식, 또는 단순한 파라미터 수에서 비롯되지 않는다. 따라서 우리는 물리적 상태 커널의 안정성과 시점 개입 하에서의 세계선 일관성이 세계 모델 설계의 일차적 목표가 되어야 하며, 그래야 세계 모델이 다음 프레임이 어떻게 보일지가 아니라 세계가 어떻게 전개될지를 포착할 수 있다고 주장한다.

English

World models are increasingly regarded as a decisive step toward artificial general intelligence, yet modeling the physical world demands more than rendering convincing frames on demand: it requires an internal world state that keeps evolving over time, decoupled from observation, so that objects endure and events run to their conclusions whether or not a camera is watching, much as the moon holds to its orbit when no one is looking. This requirement is a blind spot of existing benchmarks, which reward surface properties such as fidelity, motion, and camera controllability while never asking whether a generated world keeps evolving once it is unobserved. We introduce WRBench, the first systematic diagnostic benchmark that treats camera motion as an intervention on observability and resolves evaluation into a human-calibrated chain that asks whether the camera executes the requested interaction, whether the scene stays continuous and identifiable while in view, and whether a returning target remains consistent with the event that was set in motion. Across 9{,}600 videos from 23 models spanning four control paradigms, one finding proves stubborn: current systems maintain the observed world as a tracking shot, resuming a returning target in the state at which it was abandoned rather than advancing the event while it went unseen. Because this failure recurs across control paradigms, model families, and increments of scale, robust world-state evolution does not follow from cleaner imagery, tighter control, richer geometric priors, or sheer parameter count We therefore argue that the stability of the physical state kernel and the consistency of worldlines under viewpoint intervention should become first-class objectives of world-model design, so that a world model captures how the world will unfold rather than how the next frame appears.