VideoGameBench: 비전-언어 모델이 인기 비디오 게임을 완료할 수 있을까?

초록

비전-언어 모델(VLMs)은 인간에게 도전적인 코딩 및 수학 벤치마크에서 강력한 성과를 거두었지만, 인간에게는 자연스러운 인지, 공간 탐색, 메모리 관리와 같은 작업을 수행하는 능력은 아직 충분히 연구되지 않았습니다. 실제 비디오 게임은 인간의 내재적 귀납적 편향을 활용하여 직관적으로 배우고 숙달할 수 있도록 설계되어, 이러한 능력을 VLMs에서 평가하기 위한 이상적인 테스트베드 역할을 합니다. 이를 위해 우리는 1990년대의 인기 비디오 게임 10개로 구성된 VideoGameBench를 소개합니다. 이 벤치마크에서 VLMs은 실시간으로 게임과 직접 상호작용합니다. VideoGameBench는 모델이 게임별 스캐폴딩과 보조 정보에 의존하는 기존 설정과는 크게 달리, 원시 시각 입력과 목표 및 조작에 대한 상위 수준 설명만을 제공받은 상태에서 전체 게임을 완수하도록 요구합니다. 우리는 세 가지 게임을 비공개로 유지하여 보이지 않는 환경에 일반화할 수 있는 솔루션을 장려합니다. 실험 결과, 최첨단 비전-언어 모델들은 각 게임의 초반부를 넘어서는 데 어려움을 겪는 것으로 나타났습니다. 실시간 설정에서 추론 지연 시간이 최첨단 모델들의 주요 한계로 확인되었으며, 이에 따라 우리는 LM의 다음 동작을 기다리는 동안 게임이 일시 정지되는 VideoGameBench Lite 설정을 도입했습니다. 가장 성능이 뛰어난 모델인 Gemini 2.5 Pro는 VideoGameBench의 0.48%, VideoGameBench Lite의 1.6%만 완수했습니다. 우리는 앞서 언급한 인간의 기술을 이 벤치마크로 공식화함으로써 이러한 연구 방향의 진전을 촉진하기를 바랍니다.

English

Vision-language models (VLMs) have achieved strong results on coding and math benchmarks that are challenging for humans, yet their ability to perform tasks that come naturally to humans--such as perception, spatial navigation, and memory management--remains understudied. Real video games are crafted to be intuitive for humans to learn and master by leveraging innate inductive biases, making them an ideal testbed for evaluating such capabilities in VLMs. To this end, we introduce VideoGameBench, a benchmark consisting of 10 popular video games from the 1990s that VLMs directly interact with in real-time. VideoGameBench challenges models to complete entire games with access to only raw visual inputs and a high-level description of objectives and controls, a significant departure from existing setups that rely on game-specific scaffolding and auxiliary information. We keep three of the games secret to encourage solutions that generalize to unseen environments. Our experiments show that frontier vision-language models struggle to progress beyond the beginning of each game. We find inference latency to be a major limitation of frontier models in the real-time setting; therefore, we introduce VideoGameBench Lite, a setting where the game pauses while waiting for the LM's next action. The best performing model, Gemini 2.5 Pro, completes only 0.48% of VideoGameBench and 1.6% of VideoGameBench Lite. We hope that the formalization of the human skills mentioned above into this benchmark motivates progress in these research directions.

VideoGameBench: 비전-언어 모델이 인기 비디오 게임을 완료할 수 있을까?

VideoGameBench: Can Vision-Language Models complete popular video games?

초록

Support