VideoGameBench: 視覚言語モデルは人気ビデオゲームをクリアできるか？

要旨

視覚言語モデル（VLMs）は、人間にとって困難なコーディングや数学のベンチマークで強力な結果を達成しているが、知覚、空間ナビゲーション、メモリ管理など、人間にとって自然なタスクを遂行する能力については未だ十分に研究されていない。実際のビデオゲームは、人間が直感的に学び習得できるように、生得的な帰納的バイアスを活用して設計されており、VLMsのこうした能力を評価するための理想的なテストベッドとなっている。この目的のために、我々はVideoGameBenchを導入する。これは1990年代の10の有名なビデオゲームから構成され、VLMsがリアルタイムで直接対話するベンチマークである。VideoGameBenchは、モデルに生の視覚入力と目的と操作の高レベルな説明のみを与えてゲーム全体を完了することを要求し、ゲーム固有の足場や補助情報に依存する既存の設定から大きく逸脱している。我々は、未見の環境に一般化する解決策を促進するために、3つのゲームを秘密にしている。実験の結果、最先端の視覚言語モデルは各ゲームの序盤を超えて進むことが困難であることが示された。リアルタイム設定では、推論の遅延が最先端モデルの主要な制限要因であることが判明したため、ゲームがLMの次のアクションを待つ間に一時停止するVideoGameBench Liteを導入した。最高性能のモデルであるGemini 2.5 Proは、VideoGameBenchの0.48%、VideoGameBench Liteの1.6%しか完了できなかった。我々は、上述の人間のスキルをこのベンチマークに形式化することで、これらの研究方向への進展を促すことを期待している。

English

Vision-language models (VLMs) have achieved strong results on coding and math benchmarks that are challenging for humans, yet their ability to perform tasks that come naturally to humans--such as perception, spatial navigation, and memory management--remains understudied. Real video games are crafted to be intuitive for humans to learn and master by leveraging innate inductive biases, making them an ideal testbed for evaluating such capabilities in VLMs. To this end, we introduce VideoGameBench, a benchmark consisting of 10 popular video games from the 1990s that VLMs directly interact with in real-time. VideoGameBench challenges models to complete entire games with access to only raw visual inputs and a high-level description of objectives and controls, a significant departure from existing setups that rely on game-specific scaffolding and auxiliary information. We keep three of the games secret to encourage solutions that generalize to unseen environments. Our experiments show that frontier vision-language models struggle to progress beyond the beginning of each game. We find inference latency to be a major limitation of frontier models in the real-time setting; therefore, we introduce VideoGameBench Lite, a setting where the game pauses while waiting for the LM's next action. The best performing model, Gemini 2.5 Pro, completes only 0.48% of VideoGameBench and 1.6% of VideoGameBench Lite. We hope that the formalization of the human skills mentioned above into this benchmark motivates progress in these research directions.

VideoGameBench: 視覚言語モデルは人気ビデオゲームをクリアできるか？

VideoGameBench: Can Vision-Language Models complete popular video games?

要旨

Support