旅行規劃器:具有語言代理人的真實世界規劃基準
TravelPlanner: A Benchmark for Real-World Planning with Language Agents
February 2, 2024
作者: Jian Xie, Kai Zhang, Jiangjie Chen, Tinghui Zhu, Renze Lou, Yuandong Tian, Yanghua Xiao, Yu Su
cs.AI
摘要
自其構想以來,規劃一直是人工智慧的核心追求之一,但早期的人工智能代理主要專注於受限制的環境,因為許多人類級別規劃所需的認知基礎不足。最近,由大型語言模型(LLMs)驅動的語言代理展示了有趣的能力,如工具使用和推理。這些語言代理是否能夠在超出先前人工智能代理範圍的更複雜環境中進行規劃?為了推進這一研究,我們提出了TravelPlanner,一個新的規劃基準,專注於旅行規劃,這是一個常見的現實世界規劃場景。它提供了一個豐富的沙盒環境,各種工具,可訪問近四百萬條數據記錄,以及1225個精心策劃的規劃意圖和參考計劃。全面評估顯示,目前的語言代理尚無法處理這些複雜的規劃任務-即使是GPT-4的成功率也僅為0.6%。語言代理難以保持任務一致,使用正確的工具收集信息,或跟踪多個約束條件。然而,我們指出,語言代理僅僅有可能應對這樣一個複雜問題本身就是一項非微不足道的進展。TravelPlanner為未來語言代理提供了一個具有挑戰性但有意義的測試平臺。
English
Planning has been part of the core pursuit for artificial intelligence since
its conception, but earlier AI agents mostly focused on constrained settings
because many of the cognitive substrates necessary for human-level planning
have been lacking. Recently, language agents powered by large language models
(LLMs) have shown interesting capabilities such as tool use and reasoning. Are
these language agents capable of planning in more complex settings that are out
of the reach of prior AI agents? To advance this investigation, we propose
TravelPlanner, a new planning benchmark that focuses on travel planning, a
common real-world planning scenario. It provides a rich sandbox environment,
various tools for accessing nearly four million data records, and 1,225
meticulously curated planning intents and reference plans. Comprehensive
evaluations show that the current language agents are not yet capable of
handling such complex planning tasks-even GPT-4 only achieves a success rate of
0.6%. Language agents struggle to stay on task, use the right tools to collect
information, or keep track of multiple constraints. However, we note that the
mere possibility for language agents to tackle such a complex problem is in
itself non-trivial progress. TravelPlanner provides a challenging yet
meaningful testbed for future language agents.