lmgame-Bench：大型语言模型在游戏中的表现如何？

摘要

玩电子游戏需要感知、记忆和规划能力，这些正是现代大型语言模型（LLM）代理被期望掌握的核心能力。我们研究了利用流行电子游戏评估现代LLM的主要挑战，发现直接将LLM投入游戏中无法进行有效评估，原因有三——脆弱的视觉感知、提示敏感性以及潜在的数据污染。为此，我们引入了lmgame-Bench，将游戏转化为可靠的评估工具。lmgame-Bench包含一系列平台、解谜和叙事类游戏，通过统一的Gym风格API提供，并配备轻量级的感知与记忆框架，旨在稳定提示差异并消除数据污染。在13个领先模型的测试中，lmgame-Bench既具挑战性又能有效区分模型性能。相关性分析表明，每款游戏都探测了在其他场景中常被单独测试的独特能力组合。更有趣的是，在lmgame-Bench中的单一游戏上进行强化学习，其效果能迁移至未见过的游戏及外部的规划任务。我们的评估代码已发布于https://github.com/lmgame-org/GamingAgent/lmgame-bench。

English

Playing video games requires perception, memory, and planning, exactly the faculties modern large language model (LLM) agents are expected to master. We study the major challenges in using popular video games to evaluate modern LLMs and find that directly dropping LLMs into games cannot make an effective evaluation, for three reasons -- brittle vision perception, prompt sensitivity, and potential data contamination. We introduce lmgame-Bench to turn games into reliable evaluations. lmgame-Bench features a suite of platformer, puzzle, and narrative games delivered through a unified Gym-style API and paired with lightweight perception and memory scaffolds, and is designed to stabilize prompt variance and remove contamination. Across 13 leading models, we show lmgame-Bench is challenging while still separating models well. Correlation analysis shows that every game probes a unique blend of capabilities often tested in isolation elsewhere. More interestingly, performing reinforcement learning on a single game from lmgame-Bench transfers both to unseen games and to external planning tasks. Our evaluation code is available at https://github.com/lmgame-org/GamingAgent/lmgame-bench.

lmgame-Bench：大型语言模型在游戏中的表现如何？

lmgame-Bench: How Good are LLMs at Playing Games?

摘要

Support