Orak:一个用于训练和评估LLM代理在多样化视频游戏上的基础基准
Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games
June 4, 2025
作者: Dongmin Park, Minkyu Kim, Beongjun Choi, Junhyuck Kim, Keon Lee, Jonghyun Lee, Inkyu Park, Byeong-Uk Lee, Jaeyoung Hwang, Jaewoo Ahn, Ameya S. Mahabaleshwarkar, Bilal Kartal, Pritam Biswas, Yoshi Suhara, Kangwook Lee, Jaewoong Cho
cs.AI
摘要
大型语言模型(LLM)代理正在重塑游戏产业,尤其是通过打造更为智能且符合人类偏好的游戏角色。然而,现有的游戏基准测试未能满足实际需求:它们缺乏对不同游戏类型中LLM多样能力的评估,对复杂游戏玩法至关重要的代理模块研究,以及将预训练LLM对齐为游戏代理的微调数据集。为填补这些空白,我们推出了\benchname{},一个旨在训练和评估LLM代理跨多种现实世界视频游戏的基础基准。与现有基准不同,Orak囊括了12款涵盖所有主要类型的流行视频游戏,使得对LLM能力及复杂游戏场景中不可或缺的代理模块进行全面研究成为可能。为支持LLM的一致性评估,我们引入了一个基于模型上下文协议(MCP)的即插即用接口,使LLM能够无缝连接游戏并操控代理模块。此外,我们提出了一个微调数据集,包含跨多种游戏类型的LLM游戏轨迹。Orak提供了一个全面的评估框架,包括通用游戏得分排行榜、LLM竞技场,以及对视觉输入状态、代理策略及微调效果的深入分析,为构建通用游戏代理奠定了基础。代码可在https://github.com/krafton-ai/Orak获取。
English
Large Language Model (LLM) agents are reshaping the game industry,
particularly with more intelligent and human-preferable game characters.
However, existing game benchmarks fall short of practical needs: they lack
evaluations of diverse LLM capabilities across various game genres, studies of
agentic modules crucial for complex gameplay, and fine-tuning datasets for
aligning pre-trained LLMs into gaming agents. To fill these gaps, we present
\benchname{}, a foundational benchmark designed to train and evaluate
LLM agents across diverse real-world video games. Unlike existing benchmarks,
Orak includes 12 popular video games spanning all major genres, enabling
comprehensive studies of LLM capabilities and agentic modules essential for
intricate game scenarios. To support consistent evaluation of LLMs, we
introduce a plug-and-play interface based on Model Context Protocol (MCP) that
enables LLMs to seamlessly connect with games and manipulate agentic modules.
Additionally, we propose a fine-tuning dataset, consisting of LLM gameplay
trajectories across diverse game genres. Orak offers a comprehensive evaluation
framework, encompassing general game score leaderboards, LLM battle arenas, and
in-depth analyses of visual input state, agentic strategies, and fine-tuning
effects, establishing a foundation towards building generic gaming agents. Code
is available at https://github.com/krafton-ai/Orak.