TravelPlanner: 언어 에이전트를 활용한 실세계 계획 수립을 위한 벤치마크

초록

계획은 인공지능의 초기 개념부터 핵심적인 연구 주제로 자리 잡아왔지만, 초기 AI 에이전트들은 인간 수준의 계획을 위해 필요한 많은 인지적 기반이 부족했기 때문에 주로 제한된 환경에 초점을 맞추었습니다. 최근, 대규모 언어 모델(LLM)로 구동되는 언어 에이전트들은 도구 사용과 추론과 같은 흥미로운 능력을 보여주었습니다. 이러한 언어 에이전트들이 이전 AI 에이전트들이 도달하지 못한 더 복잡한 환경에서 계획을 수립할 수 있을까요? 이 연구를 진전시키기 위해, 우리는 여행 계획이라는 일반적인 실세계 계획 시나리오에 초점을 맞춘 새로운 계획 벤치마크인 TravelPlanner를 제안합니다. 이 벤치마크는 풍부한 샌드박스 환경, 약 400만 개의 데이터 레코드에 접근할 수 있는 다양한 도구, 그리고 1,225개의 세심하게 선별된 계획 의도와 참조 계획을 제공합니다. 포괄적인 평가 결과, 현재의 언어 에이전트들은 이러한 복잡한 계획 작업을 처리할 능력이 아직 부족한 것으로 나타났습니다. GPT-4조차도 성공률이 0.6%에 불과했습니다. 언어 에이전트들은 작업에 집중하거나, 올바른 도구를 사용해 정보를 수집하거나, 여러 제약 조건을 추적하는 데 어려움을 겪었습니다. 그러나 언어 에이전트들이 이러한 복잡한 문제를 다룰 가능성 자체가 이미 중요한 진전임을 주목합니다. TravelPlanner는 미래의 언어 에이전트들을 위한 도전적이면서도 의미 있는 테스트베드를 제공합니다.

English

Planning has been part of the core pursuit for artificial intelligence since its conception, but earlier AI agents mostly focused on constrained settings because many of the cognitive substrates necessary for human-level planning have been lacking. Recently, language agents powered by large language models (LLMs) have shown interesting capabilities such as tool use and reasoning. Are these language agents capable of planning in more complex settings that are out of the reach of prior AI agents? To advance this investigation, we propose TravelPlanner, a new planning benchmark that focuses on travel planning, a common real-world planning scenario. It provides a rich sandbox environment, various tools for accessing nearly four million data records, and 1,225 meticulously curated planning intents and reference plans. Comprehensive evaluations show that the current language agents are not yet capable of handling such complex planning tasks-even GPT-4 only achieves a success rate of 0.6%. Language agents struggle to stay on task, use the right tools to collect information, or keep track of multiple constraints. However, we note that the mere possibility for language agents to tackle such a complex problem is in itself non-trivial progress. TravelPlanner provides a challenging yet meaningful testbed for future language agents.

TravelPlanner: 언어 에이전트를 활용한 실세계 계획 수립을 위한 벤치마크

TravelPlanner: A Benchmark for Real-World Planning with Language Agents

초록

Support