ChatPaper.aiChatPaper

旅行规划器:一个用语言代理进行真实世界规划的基准测试

TravelPlanner: A Benchmark for Real-World Planning with Language Agents

February 2, 2024
作者: Jian Xie, Kai Zhang, Jiangjie Chen, Tinghui Zhu, Renze Lou, Yuandong Tian, Yanghua Xiao, Yu Su
cs.AI

摘要

自其创立以来,规划一直是人工智能的核心追求之一,但早期的AI代理主要专注于受限环境,因为缺乏人类级规划所需的许多认知基础。最近,由大型语言模型(LLMs)驱动的语言代理展示了诸如工具使用和推理等有趣能力。这些语言代理是否能够在超出先前AI代理能力范围的更复杂环境中进行规划?为推进这一调查,我们提出了TravelPlanner,一个新的规划基准,专注于旅行规划,这是一个常见的现实世界规划场景。它提供了一个丰富的沙盒环境,各种工具用于访问近400万条数据记录,以及1225个精心策划的规划意图和参考计划。全面评估显示,当前的语言代理尚无法处理这样复杂的规划任务-即使是GPT-4的成功率也仅为0.6%。语言代理难以保持任务连贯性,使用正确的工具收集信息,或跟踪多个约束条件。然而,我们注意到,语言代理仅仅有可能解决这样一个复杂问题本身就是一项非平凡的进展。TravelPlanner为未来语言代理提供了一个具有挑战性但有意义的测试平台。
English
Planning has been part of the core pursuit for artificial intelligence since its conception, but earlier AI agents mostly focused on constrained settings because many of the cognitive substrates necessary for human-level planning have been lacking. Recently, language agents powered by large language models (LLMs) have shown interesting capabilities such as tool use and reasoning. Are these language agents capable of planning in more complex settings that are out of the reach of prior AI agents? To advance this investigation, we propose TravelPlanner, a new planning benchmark that focuses on travel planning, a common real-world planning scenario. It provides a rich sandbox environment, various tools for accessing nearly four million data records, and 1,225 meticulously curated planning intents and reference plans. Comprehensive evaluations show that the current language agents are not yet capable of handling such complex planning tasks-even GPT-4 only achieves a success rate of 0.6%. Language agents struggle to stay on task, use the right tools to collect information, or keep track of multiple constraints. However, we note that the mere possibility for language agents to tackle such a complex problem is in itself non-trivial progress. TravelPlanner provides a challenging yet meaningful testbed for future language agents.
PDF372December 15, 2024