自然計劃：在自然語言規劃上對語言模型進行基準測試

摘要

我們介紹了NATURAL PLAN，這是一個以自然語言為基礎的實際規劃基準，包含三個關鍵任務：旅行規劃、會議規劃和日曆排程。我們專注於具有關於任務的完整信息的LLMs的規劃能力評估，通過提供來自工具（如Google Flights、Google Maps和Google Calendar）的輸出作為模型的上下文。這消除了在規劃評估LLMs時需要使用工具環境的需求。我們觀察到NATURAL PLAN對於最先進模型來說是一個具有挑戰性的基準。例如，在旅行規劃中，GPT-4和Gemini 1.5 Pro僅能分別達到31.1%和34.8%的解決率。我們發現隨著問題複雜度的增加，模型的表現急劇下降：當有10個城市時，所有模型的表現都低於5%，突顯了最先進LLMs在自然語言規劃方面存在顯著差距。我們還對NATURAL PLAN進行了廣泛的消融研究，以進一步闡明自我校正、少量樣本泛化和在長上下文中進行規劃等方法對改善LLM規劃的（無）效性。

English

We introduce NATURAL PLAN, a realistic planning benchmark in natural language containing 3 key tasks: Trip Planning, Meeting Planning, and Calendar Scheduling. We focus our evaluation on the planning capabilities of LLMs with full information on the task, by providing outputs from tools such as Google Flights, Google Maps, and Google Calendar as contexts to the models. This eliminates the need for a tool-use environment for evaluating LLMs on Planning. We observe that NATURAL PLAN is a challenging benchmark for state of the art models. For example, in Trip Planning, GPT-4 and Gemini 1.5 Pro could only achieve 31.1% and 34.8% solve rate respectively. We find that model performance drops drastically as the complexity of the problem increases: all models perform below 5% when there are 10 cities, highlighting a significant gap in planning in natural language for SoTA LLMs. We also conduct extensive ablation studies on NATURAL PLAN to further shed light on the (in)effectiveness of approaches such as self-correction, few-shot generalization, and in-context planning with long-contexts on improving LLM planning.

自然計劃：在自然語言規劃上對語言模型進行基準測試

NATURAL PLAN: Benchmarking LLMs on Natural Language Planning

摘要

Support