NATURAL PLAN: 자연어 계획에 대한 LLM 벤치마킹

초록

우리는 자연어로 구성된 현실적인 계획 수립 벤치마크인 NATURAL PLAN을 소개한다. 이 벤치마크는 여행 계획, 회의 계획, 일정 스케줄링이라는 3가지 주요 과제를 포함한다. 우리는 LLM(Large Language Model)의 계획 수립 능력을 평가하기 위해 Google Flights, Google Maps, Google Calendar와 같은 도구의 출력을 모델에 제공하여 작업에 대한 완전한 정보를 제공함으로써, 도구 사용 환경 없이도 LLM의 계획 수립 능력을 평가할 수 있도록 했다. 우리는 NATURAL PLAN이 최신 모델들에게도 도전적인 벤치마크임을 관찰했다. 예를 들어, 여행 계획 과제에서 GPT-4와 Gemini 1.5 Pro는 각각 31.1%와 34.8%의 해결률을 보였다. 또한 문제의 복잡성이 증가함에 따라 모델 성능이 급격히 하락하는 것을 확인했다: 10개 도시가 포함된 경우 모든 모델의 성능이 5% 미만으로 떨어졌으며, 이는 최신 LLM의 자연어 계획 수립 능력에 상당한 격차가 있음을 보여준다. 우리는 또한 NATURAL PLAN에 대한 광범위한 절제 연구를 수행하여 자기 수정, 소수 샷 일반화, 장문 맥락 내 계획 수립과 같은 접근 방식이 LLM의 계획 수립 능력을 개선하는 데 있어 (비)효과적인지에 대한 추가적인 통찰을 제공했다.

English

We introduce NATURAL PLAN, a realistic planning benchmark in natural language containing 3 key tasks: Trip Planning, Meeting Planning, and Calendar Scheduling. We focus our evaluation on the planning capabilities of LLMs with full information on the task, by providing outputs from tools such as Google Flights, Google Maps, and Google Calendar as contexts to the models. This eliminates the need for a tool-use environment for evaluating LLMs on Planning. We observe that NATURAL PLAN is a challenging benchmark for state of the art models. For example, in Trip Planning, GPT-4 and Gemini 1.5 Pro could only achieve 31.1% and 34.8% solve rate respectively. We find that model performance drops drastically as the complexity of the problem increases: all models perform below 5% when there are 10 cities, highlighting a significant gap in planning in natural language for SoTA LLMs. We also conduct extensive ablation studies on NATURAL PLAN to further shed light on the (in)effectiveness of approaches such as self-correction, few-shot generalization, and in-context planning with long-contexts on improving LLM planning.

NATURAL PLAN: 자연어 계획에 대한 LLM 벤치마킹

NATURAL PLAN: Benchmarking LLMs on Natural Language Planning

초록

Support