文本冒险：大型语言模型在基于文本的视频游戏中的表现如何？

摘要

在模擬現實世界挑戰的複雜互動環境中評估人工智能代理，對於理解其實際能力至關重要。現有的代理基準雖然能有效評估工具使用或結構化任務表現等技能，但往往未能全面捕捉代理在探索性環境中自主運作的能力，這類環境要求代理在長期且不斷擴展的上下文中進行持續、自主的推理。為推動開發具備更強內在長期推理能力的代理，我們引入了TextQuests，這是一個基於Infocom系列互動小說遊戲的基準。這些基於文本的冒險遊戲，人類玩家可能需要超過30小時並執行數百次精確操作才能完成，它們作為評估人工智能代理在專注、有狀態任務上表現的有效替代品。該基準特別設計來評估大型語言模型代理的獨立問題解決能力，通過排除外部工具的使用，從而聚焦於在一個以試錯學習和單一互動會話內持續問題解決為特徵的探索性環境中的內在長期上下文推理能力。我們在https://textquests.ai發布了TextQuests。

English

Evaluating AI agents within complex, interactive environments that mirror real-world challenges is critical for understanding their practical capabilities. While existing agent benchmarks effectively assess skills like tool use or performance on structured tasks, they often do not fully capture an agent's ability to operate autonomously in exploratory environments that demand sustained, self-directed reasoning over a long and growing context. To spur the development of agents capable of more robust intrinsic reasoning over long horizons, we introduce TextQuests, a benchmark based on the Infocom suite of interactive fiction games. These text-based adventures, which can take human players over 30 hours and require hundreds of precise actions to solve, serve as an effective proxy for evaluating AI agents on focused, stateful tasks. The benchmark is specifically designed to assess an LLM agent's capacity for self-contained problem-solving by precluding the use of external tools, thereby focusing on intrinsic long-context reasoning capabilities in an exploratory environment characterized by the need for trial-and-error learning and sustained problem-solving within a single interactive session. We release TextQuests at https://textquests.ai.