텍스트 퀘스트: 텍스트 기반 비디오 게임에서 LLM의 성능은 어느 정도인가?

초록

실제 세계의 도전 과제를 반영한 복잡하고 상호작용적인 환경 내에서 AI 에이전트를 평가하는 것은 그들의 실질적인 능력을 이해하는 데 매우 중요합니다. 기존의 에이전트 벤치마크는 도구 사용이나 구조화된 작업 수행 능력을 효과적으로 평가하지만, 종종 장기적이고 점점 확장되는 맥락에서 지속적이고 자기 주도적인 추론을 요구하는 탐색적 환경에서 에이전트가 자율적으로 작동하는 능력을 완전히 포착하지 못합니다. 더 강력한 내재적 추론 능력을 장기적으로 갖춘 에이전트의 개발을 촉진하기 위해, 우리는 Infocom 인터랙티브 픽션 게임 제품군을 기반으로 한 TextQuests 벤치마크를 소개합니다. 인간 플레이어가 30시간 이상 소요하고 수백 가지의 정확한 행동을 요구하는 이러한 텍스트 기반 어드벤처 게임은 AI 에이전트의 집중적이고 상태 유지적인 작업 수행 능력을 평가하는 효과적인 대리 수단으로 기능합니다. 이 벤치마크는 외부 도구 사용을 배제함으로써 LLM 에이전트의 자체적인 문제 해결 능력을 평가하도록 특별히 설계되었으며, 시행착오 학습과 단일 상호작용 세션 내에서의 지속적인 문제 해결이 필요한 탐색적 환경에서의 내재적 장기 맥락 추론 능력에 초점을 맞춥니다. 우리는 TextQuests를 https://textquests.ai에서 공개합니다.

English

Evaluating AI agents within complex, interactive environments that mirror real-world challenges is critical for understanding their practical capabilities. While existing agent benchmarks effectively assess skills like tool use or performance on structured tasks, they often do not fully capture an agent's ability to operate autonomously in exploratory environments that demand sustained, self-directed reasoning over a long and growing context. To spur the development of agents capable of more robust intrinsic reasoning over long horizons, we introduce TextQuests, a benchmark based on the Infocom suite of interactive fiction games. These text-based adventures, which can take human players over 30 hours and require hundreds of precise actions to solve, serve as an effective proxy for evaluating AI agents on focused, stateful tasks. The benchmark is specifically designed to assess an LLM agent's capacity for self-contained problem-solving by precluding the use of external tools, thereby focusing on intrinsic long-context reasoning capabilities in an exploratory environment characterized by the need for trial-and-error learning and sustained problem-solving within a single interactive session. We release TextQuests at https://textquests.ai.

텍스트 퀘스트: 텍스트 기반 비디오 게임에서 LLM의 성능은 어느 정도인가?

TextQuests: How Good are LLMs at Text-Based Video Games?

초록

Support