文本冒险游戏:大语言模型在基于文本的游戏中表现如何?
TextQuests: How Good are LLMs at Text-Based Video Games?
July 31, 2025
作者: Long Phan, Mantas Mazeika, Andy Zou, Dan Hendrycks
cs.AI
摘要
在模拟现实世界挑战的复杂交互环境中评估AI智能体,对于理解其实际能力至关重要。现有的智能体基准测试虽能有效评估工具使用或结构化任务表现等技能,却往往未能全面捕捉智能体在探索性环境中自主运作的能力,这类环境要求智能体在持续扩展的上下文中进行长期、自主的推理。为促进开发具备更强内在长期推理能力的智能体,我们推出了TextQuests基准,该基准基于Infocom系列的互动小说游戏。这些文本冒险游戏,人类玩家可能需要超过30小时并执行数百次精确操作才能完成,为评估AI智能体在专注、有状态任务上的表现提供了有效代理。TextQuests基准特别设计用于评估LLM智能体的独立问题解决能力,通过禁止使用外部工具,专注于在探索性环境中展现的内在长上下文推理能力,这种环境以试错学习和单一交互会话内持续解决问题为特征。我们已在https://textquests.ai发布TextQuests。
English
Evaluating AI agents within complex, interactive environments that mirror
real-world challenges is critical for understanding their practical
capabilities. While existing agent benchmarks effectively assess skills like
tool use or performance on structured tasks, they often do not fully capture an
agent's ability to operate autonomously in exploratory environments that demand
sustained, self-directed reasoning over a long and growing context. To spur the
development of agents capable of more robust intrinsic reasoning over long
horizons, we introduce TextQuests, a benchmark based on the Infocom suite of
interactive fiction games. These text-based adventures, which can take human
players over 30 hours and require hundreds of precise actions to solve, serve
as an effective proxy for evaluating AI agents on focused, stateful tasks. The
benchmark is specifically designed to assess an LLM agent's capacity for
self-contained problem-solving by precluding the use of external tools, thereby
focusing on intrinsic long-context reasoning capabilities in an exploratory
environment characterized by the need for trial-and-error learning and
sustained problem-solving within a single interactive session. We release
TextQuests at https://textquests.ai.