ChatPaper.aiChatPaper

自然规划:在自然语言规划上对LLM进行基准测试

NATURAL PLAN: Benchmarking LLMs on Natural Language Planning

June 6, 2024
作者: Huaixiu Steven Zheng, Swaroop Mishra, Hugh Zhang, Xinyun Chen, Minmin Chen, Azade Nova, Le Hou, Heng-Tze Cheng, Quoc V. Le, Ed H. Chi, Denny Zhou
cs.AI

摘要

我们介绍了NATURAL PLAN,这是一个包含三个关键任务的自然语言实际规划基准:旅行规划、会议规划和日历安排。我们专注于具有任务完整信息的LLMs的规划能力评估,通过提供来自Google Flights、Google Maps和Google Calendar等工具的输出作为模型的上下文。这消除了在规划评估中需要工具使用环境的需求。我们观察到NATURAL PLAN对于最先进模型来说是一个具有挑战性的基准。例如,在旅行规划中,GPT-4和Gemini 1.5 Pro仅能分别实现31.1%和34.8%的解决率。我们发现随着问题复杂性的增加,模型性能急剧下降:当涉及10个城市时,所有模型的表现均低于5%,突显了最先进LLMs在自然语言规划方面存在重大差距。我们还对NATURAL PLAN进行了广泛的消融研究,以进一步阐明自我校正、少样本泛化和长上下文中的上下文规划等方法对LLM规划改进的(无)效性。
English
We introduce NATURAL PLAN, a realistic planning benchmark in natural language containing 3 key tasks: Trip Planning, Meeting Planning, and Calendar Scheduling. We focus our evaluation on the planning capabilities of LLMs with full information on the task, by providing outputs from tools such as Google Flights, Google Maps, and Google Calendar as contexts to the models. This eliminates the need for a tool-use environment for evaluating LLMs on Planning. We observe that NATURAL PLAN is a challenging benchmark for state of the art models. For example, in Trip Planning, GPT-4 and Gemini 1.5 Pro could only achieve 31.1% and 34.8% solve rate respectively. We find that model performance drops drastically as the complexity of the problem increases: all models perform below 5% when there are 10 cities, highlighting a significant gap in planning in natural language for SoTA LLMs. We also conduct extensive ablation studies on NATURAL PLAN to further shed light on the (in)effectiveness of approaches such as self-correction, few-shot generalization, and in-context planning with long-contexts on improving LLM planning.

Summary

AI-Generated Summary

PDF140December 8, 2024