lmgame-Bench：大型語言模型在遊戲中的表現如何？

摘要

玩電子遊戲需要感知、記憶和規劃能力，這些正是現代大型語言模型（LLM）代理被期望掌握的核心能力。我們探討了利用流行電子遊戲來評估現代LLM所面臨的主要挑戰，發現直接將LLM置於遊戲中無法進行有效評估，原因有三——脆弱的視覺感知、提示敏感性和潛在的數據污染。我們引入了lmgame-Bench，將遊戲轉化為可靠的評估工具。lmgame-Bench包含一系列平臺、解謎和敘事遊戲，通過統一的Gym風格API提供，並配備輕量級的感知和記憶框架，旨在穩定提示變異並消除污染。在13個領先模型的測試中，我們展示了lmgame-Bench既具挑戰性又能有效區分模型性能。相關性分析表明，每款遊戲都探測了在其他場合常被單獨測試的獨特能力組合。更有趣的是，在lmgame-Bench中的單一遊戲上進行強化學習，其效果能夠遷移到未見遊戲及外部規劃任務中。我們的評估代碼可在https://github.com/lmgame-org/GamingAgent/lmgame-bench 獲取。

English

Playing video games requires perception, memory, and planning, exactly the faculties modern large language model (LLM) agents are expected to master. We study the major challenges in using popular video games to evaluate modern LLMs and find that directly dropping LLMs into games cannot make an effective evaluation, for three reasons -- brittle vision perception, prompt sensitivity, and potential data contamination. We introduce lmgame-Bench to turn games into reliable evaluations. lmgame-Bench features a suite of platformer, puzzle, and narrative games delivered through a unified Gym-style API and paired with lightweight perception and memory scaffolds, and is designed to stabilize prompt variance and remove contamination. Across 13 leading models, we show lmgame-Bench is challenging while still separating models well. Correlation analysis shows that every game probes a unique blend of capabilities often tested in isolation elsewhere. More interestingly, performing reinforcement learning on a single game from lmgame-Bench transfers both to unseen games and to external planning tasks. Our evaluation code is available at https://github.com/lmgame-org/GamingAgent/lmgame-bench.

lmgame-Bench：大型語言模型在遊戲中的表現如何？

lmgame-Bench: How Good are LLMs at Playing Games?

摘要

Support