비전-언어 모델은 내부 세계 모델을 가지고 있는가? 원자적 평가를 향하여

초록

내부 세계 모델(World Models, WMs)은 에이전트가 세계의 상태를 이해하고 전이를 예측할 수 있게 하여, 고급 사고적 추론의 기반을 제공한다. 최근 OpenAI의 GPT-4o와 Gemini와 같은 대규모 시각-언어 모델(Vision-Language Models, VLMs)은 범용 세계 모델로서의 잠재력을 보여주고 있다. 최신 연구들은 시각 이해와 같은 특정 능력에 대한 한계를 평가하고 보여주었지만, VLMs의 기본적인 세계 모델 능력에 대한 체계적인 평가는 아직 이루어지지 않았다. 비교심리학과 인지과학을 바탕으로, 우리는 시각, 공간, 시간, 양적, 운동적 지각(Perception)과 기계적 시뮬레이션, 전이 추론, 구성적 추론(Prediction)을 평가하는 두 단계 프레임워크를 제안하여 VLMs를 세계 모델로서 원자적 수준에서 평가한다. 이 프레임워크를 기반으로, 우리는 WM-ABench라는 대규모 벤치마크를 소개한다. 이 벤치마크는 6가지 다양한 시뮬레이션 환경에서 통제된 반사실적 시뮬레이션을 통해 23개의 세부 평가 차원을 포함한다. 15개의 최신 상용 및 오픈소스 VLMs에 대한 660개의 실험을 통해, 우리는 이러한 모델들이 기본적인 세계 모델링 능력에서 현저한 한계를 보인다는 것을 발견했다. 예를 들어, 거의 모든 모델들이 운동 궤적을 구별할 때 거의 무작위 수준의 정확도를 보였다. 또한, 이들은 분리된 이해 능력이 부족하다. 예를 들어, 일부 모델들은 파란색 물체가 초록색 물체보다 더 빠르게 움직인다고 믿는 경향이 있다. 더 풍부한 결과와 분석은 VLMs과 인간 수준의 세계 모델링 사이에 상당한 격차가 있음을 보여준다.

English

Internal world models (WMs) enable agents to understand the world's state and predict transitions, serving as the basis for advanced deliberative reasoning. Recent large Vision-Language Models (VLMs), such as OpenAI o3, GPT-4o and Gemini, exhibit potential as general-purpose WMs. While the latest studies have evaluated and shown limitations in specific capabilities such as visual understanding, a systematic evaluation of VLMs' fundamental WM abilities remains absent. Drawing on comparative psychology and cognitive science, we propose a two-stage framework that assesses Perception (visual, spatial, temporal, quantitative, and motion) and Prediction (mechanistic simulation, transitive inference, compositional inference) to provide an atomic evaluation of VLMs as WMs. Guided by this framework, we introduce WM-ABench, a large-scale benchmark comprising 23 fine-grained evaluation dimensions across 6 diverse simulated environments with controlled counterfactual simulations. Through 660 experiments on 15 latest commercial and open-source VLMs, we find that these models exhibit striking limitations in basic world modeling abilities. For instance, almost all models perform at near-random accuracy when distinguishing motion trajectories. Additionally, they lack disentangled understanding -- e.g., some models tend to believe blue objects move faster than green ones. More rich results and analyses reveal significant gaps between VLMs and human-level world modeling.

비전-언어 모델은 내부 세계 모델을 가지고 있는가? 원자적 평가를 향하여

Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation

초록

Support