GameWorld: マルチモーダルゲームエージェントの標準化および検証可能な評価に向けて

要旨

現実世界でのインタラクションに向けた具身化されたジェネラリストとして、マルチモーダル大規模言語モデル（MLLM）エージェントは、依然として高いレイテンシ、疎なフィードバック、不可逆的なミットという課題に直面している。ビデオゲームは、豊富な視覚観察と閉ループ型インタラクションを提供し、細粒度の知覚、長期的な計画立案、精密な制御を要求する理想的なテストベッドである。しかし、これらの能力を体系的に評価することは、現在、異種混在的なアクションインターフェースとヒューリスティックな検証によって妨げられている。この目的のために、我々はブラウザ環境における汎用ゲームエージェントとしてのMLLMの標準化され検証可能な評価のために設計されたベンチマーク、GameWorldを提案する。2種類のゲームエージェントインターフェースを検討する：(i) キーボードとマウスの制御を直接出力するコンピュータ使用エージェント、および (ii) 決定論的セマンティックアクションパーシングを介してセマンティックアクション空間で行動する汎用マルチモーダルエージェントである。GameWorldは34の多様なゲームと170のタスクを含み、それぞれが結果ベースの評価のための状態検証可能な指標とペアになっている。18のモデルとインターフェースの組み合わせにおける結果は、最高性能のエージェントでさえ、ビデオゲームにおいて人間の能力には程遠いことを示唆している。ベンチマーク全体を繰り返し実行した大規模な実験は、本ベンチマークの堅牢性を実証している。一方、リアルタイムインタラクション、コンテキストメモリ感度、アクション有効性に関するさらなる研究は、ゲームエージェントが直面するさらなる課題を明らかにする。全体として、標準化され、検証可能で、再現性のある評価フレームワークを提供することにより、GameWorldはマルチモーダルゲームエージェントおよびそれ以降の研究の発展に向けた堅牢な基盤を築く。プロジェクトページは https://gameworld-bench.github.io にある。

English

Towards an embodied generalist for real-world interaction, Multimodal Large Language Model (MLLM) agents still suffer from challenging latency, sparse feedback, and irreversible mistakes. Video games offer an ideal testbed with rich visual observations and closed-loop interaction, demanding fine-grained perception, long-horizon planning, and precise control. However, systematically evaluating these capabilities is currently hindered by heterogeneous action interfaces and heuristic verification. To this end, we introduce GameWorld, a benchmark designed for standardized and verifiable evaluation of MLLMs as generalist game agents in browser environments. Two game agent interfaces are studied: (i) computer-use agents that directly emit keyboard and mouse controls, and (ii) generalist multimodal agents that act in a semantic action space via deterministic Semantic Action Parsing. GameWorld contains 34 diverse games and 170 tasks, each paired with state-verifiable metrics for outcome-based evaluation. The results across 18 model-interface pairs suggest that even the best performing agent is far from achieving human capabilities on video games. Extensive experiments of repeated full-benchmark reruns demonstrate the robustness of the benchmark, while further studies on real-time interaction, context-memory sensitivity, and action validity expose more challenges ahead for game agents. Together, by offering a standardized, verifiable, and reproducible evaluation framework, GameWorld lays a robust foundation for advancing research on multimodal game agents and beyond. The project page is at https://gameworld-bench.github.io.

GameWorld: マルチモーダルゲームエージェントの標準化および検証可能な評価に向けて

GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents

要旨

Support