lmgame-Bench: LLM은 게임을 얼마나 잘 할 수 있을까?

초록

비디오 게임을 플레이하려면 인지, 기억, 계획 능력이 필요하며, 이는 현대의 대형 언어 모델(LLM) 에이전트가 숙달해야 할 핵심 역량입니다. 우리는 현대 LLM을 평가하기 위해 인기 있는 비디오 게임을 사용할 때 발생하는 주요 문제를 연구했으며, LLM을 게임에 직접 적용하는 것이 효과적인 평가 방법이 될 수 없는 세 가지 이유를 발견했습니다: 취약한 시각 인지, 프롬프트 민감도, 그리고 잠재적인 데이터 오염 문제입니다. 이를 해결하기 위해 우리는 게임을 신뢰할 수 있는 평가 도구로 전환하는 lmgame-Bench를 소개합니다. lmgame-Bench는 플랫포머, 퍼즐, 내러티브 게임을 통합된 Gym 스타일 API로 제공하며, 경량화된 인지 및 기억 스캐폴드를 함께 제공합니다. 이 도구는 프롬프트 변동성을 안정화하고 데이터 오염을 제거하도록 설계되었습니다. 13개의 주요 모델을 대상으로 한 평가에서 lmgame-Bench는 도전적이면서도 모델들을 잘 구분할 수 있음을 보여줍니다. 상관관계 분석은 각 게임이 종종 별도로 테스트되는 다양한 역량의 독특한 조합을 탐구한다는 것을 보여줍니다. 더 흥미롭게도, lmgame-Bench의 단일 게임에서 강화 학습을 수행하면 보지 못한 게임과 외부 계획 작업으로의 전이가 가능합니다. 우리의 평가 코드는 https://github.com/lmgame-org/GamingAgent/lmgame-bench에서 확인할 수 있습니다.

English

Playing video games requires perception, memory, and planning, exactly the faculties modern large language model (LLM) agents are expected to master. We study the major challenges in using popular video games to evaluate modern LLMs and find that directly dropping LLMs into games cannot make an effective evaluation, for three reasons -- brittle vision perception, prompt sensitivity, and potential data contamination. We introduce lmgame-Bench to turn games into reliable evaluations. lmgame-Bench features a suite of platformer, puzzle, and narrative games delivered through a unified Gym-style API and paired with lightweight perception and memory scaffolds, and is designed to stabilize prompt variance and remove contamination. Across 13 leading models, we show lmgame-Bench is challenging while still separating models well. Correlation analysis shows that every game probes a unique blend of capabilities often tested in isolation elsewhere. More interestingly, performing reinforcement learning on a single game from lmgame-Bench transfers both to unseen games and to external planning tasks. Our evaluation code is available at https://github.com/lmgame-org/GamingAgent/lmgame-bench.

lmgame-Bench: LLM은 게임을 얼마나 잘 할 수 있을까?

lmgame-Bench: How Good are LLMs at Playing Games?

초록

Support