lmgame-Bench: LLMはゲームをどれだけ上手にプレイできるか？

要旨

ビデオゲームをプレイするには、知覚、記憶、計画といった能力が必要であり、これらはまさに現代の大規模言語モデル（LLM）エージェントが習得すべき能力である。我々は、現代のLLMを評価するために人気のあるビデオゲームを使用する際の主要な課題を研究し、LLMを直接ゲームに投入しても効果的な評価ができないことを明らかにした。その理由は、脆弱な視覚知覚、プロンプトの感度、そして潜在的なデータ汚染の3つである。我々は、ゲームを信頼性のある評価に変えるためにlmgame-Benchを導入した。lmgame-Benchは、プラットフォーマー、パズル、ナラティブゲームのスイートを提供し、統一されたGymスタイルのAPIを通じて配信され、軽量な知覚と記憶の足場と組み合わせられている。また、プロンプトのばらつきを安定させ、汚染を除去するように設計されている。13の主要なモデルを対象とした評価では、lmgame-Benchが挑戦的でありながら、モデルをよく分離することが示された。相関分析によると、各ゲームは、他の場所で単独でテストされることが多い能力の独自の組み合わせを探る。さらに興味深いことに、lmgame-Benchの単一のゲームで強化学習を行うことで、未見のゲームや外部の計画タスクにも転移することがわかった。我々の評価コードはhttps://github.com/lmgame-org/GamingAgent/lmgame-benchで公開されている。

English

Playing video games requires perception, memory, and planning, exactly the faculties modern large language model (LLM) agents are expected to master. We study the major challenges in using popular video games to evaluate modern LLMs and find that directly dropping LLMs into games cannot make an effective evaluation, for three reasons -- brittle vision perception, prompt sensitivity, and potential data contamination. We introduce lmgame-Bench to turn games into reliable evaluations. lmgame-Bench features a suite of platformer, puzzle, and narrative games delivered through a unified Gym-style API and paired with lightweight perception and memory scaffolds, and is designed to stabilize prompt variance and remove contamination. Across 13 leading models, we show lmgame-Bench is challenging while still separating models well. Correlation analysis shows that every game probes a unique blend of capabilities often tested in isolation elsewhere. More interestingly, performing reinforcement learning on a single game from lmgame-Bench transfers both to unseen games and to external planning tasks. Our evaluation code is available at https://github.com/lmgame-org/GamingAgent/lmgame-bench.

lmgame-Bench: LLMはゲームをどれだけ上手にプレイできるか？

lmgame-Bench: How Good are LLMs at Playing Games?

要旨

Support