발록: 게임에서의 에이전틱 LLM 및 VLM 추론 성능 측정

초록

대형 언어 모델(Large Language Models, LLMs)과 시각 언어 모델(Vision Language Models, VLMs)은 방대한 지식을 보유하고 유망한 추론 능력을 나타내지만, 여전히 복잡하고 동적인 환경에서 잘 수행하기 어려워합니다. 실제 세계의 작업은 복잡한 상호 작용, 고급 공간 추론, 장기 계획, 그리고 새로운 전략을 지속적으로 탐색하는 것을 필요로 합니다. 이러한 영역들에 대해 우리는 이러한 능력을 체계적으로 평가하는 효과적인 방법론이 부족합니다. 이러한 공백을 해결하기 위해 우리는 LLMs와 VLMs의 에이전트 능력을 다양한 어려운 게임들을 통해 평가하기 위해 설계된 혁신적인 벤치마크인 BALROG을 소개합니다. 우리의 벤치마크는 난이도가 다양한 기존 강화 학습 환경을 포함하며, 초보자가 몇 초 안에 해결할 수 있는 작업부터 연구자가 몇 년이 걸릴 수도 있는 매우 어려운 작업(예: NetHack Learning Environment)까지 포함합니다. 우리는 성능을 측정하기 위한 세밀한 지표를 설계하고, 여러 인기 있는 오픈 소스 및 폐쇄 소스 LLMs와 VLMs를 철저히 평가합니다. 우리의 연구 결과는 현재 모델이 쉬운 게임에서 일부 성공을 거두지만, 더 어려운 작업에서는 심각한 어려움을 겪는다는 것을 보여줍니다. 특히, 환경의 시각적 표현이 제공될 때 모델이 더 나쁜 성과를 내는 시각 기반 의사 결정에서 심각한 결핍을 관찰합니다. 우리는 BALROG을 오픈 및 사용자 친화적인 벤치마크로 공개하여 에이전트 커뮤니티에서의 미래 연구 및 개발을 촉진합니다.

English

Large Language Models (LLMs) and Vision Language Models (VLMs) possess extensive knowledge and exhibit promising reasoning abilities; however, they still struggle to perform well in complex, dynamic environments. Real-world tasks require handling intricate interactions, advanced spatial reasoning, long-term planning, and continuous exploration of new strategies-areas in which we lack effective methodologies for comprehensively evaluating these capabilities. To address this gap, we introduce BALROG, a novel benchmark designed to assess the agentic capabilities of LLMs and VLMs through a diverse set of challenging games. Our benchmark incorporates a range of existing reinforcement learning environments with varying levels of difficulty, including tasks that are solvable by non-expert humans in seconds to extremely challenging ones that may take years to master (e.g., the NetHack Learning Environment). We devise fine-grained metrics to measure performance and conduct an extensive evaluation of several popular open-source and closed-source LLMs and VLMs. Our findings indicate that while current models achieve partial success in the easier games, they struggle significantly with more challenging tasks. Notably, we observe severe deficiencies in vision-based decision-making, as models perform worse when visual representations of the environments are provided. We release BALROG as an open and user-friendly benchmark to facilitate future research and development in the agentic community.

발록: 게임에서의 에이전틱 LLM 및 VLM 추론 성능 측정

BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games

초록

Support