星座館：將文本翻譯為結構化計劃語言的嚴謹基準

摘要

許多最近的研究探討了使用語言模型來解決規劃問題。其中一個研究方向專注於將規劃任務的自然語言描述轉換為結構化的規劃語言，例如規劃領域定義語言（PDDL）。儘管這種方法很有潛力，但準確衡量生成的 PDDL 代碼質量仍然存在重大挑戰。首先，生成的 PDDL 代碼通常是通過檢查問題是否可以用規劃器解決的規劃驗證器來評估的。這種方法是不夠的，因為語言模型可能生成有效的 PDDL 代碼，但與任務的自然語言描述不一致。其次，現有的評估集通常具有與真實 PDDL 非常相似的規劃任務的自然語言描述，降低了任務的挑戰性。為了彌合這一差距，我們引入了 \benchmarkName，這是一個旨在評估語言模型從規劃任務的自然語言描述中生成 PDDL 代碼的基準。我們首先創建了一個 PDDL 等價算法，通過靈活地將生成的 PDDL 代碼與真實 PDDL 進行比較，嚴格評估語言模型生成的正確性。然後，我們提供了一個包含 13 種不同任務、具有不同難度水平的 132,037 組文本到 PDDL 對的數據集。最後，我們評估了幾種 API 訪問和開放權重的語言模型，揭示了這個任務的複雜性。例如，由 GPT-4o 生成的 PDDL 問題描述中，有 87.6% 在語法上可解析，82.2% 是有效的、可解決的問題，但只有 35.1% 在語義上是正確的，突顯了對這個問題需要更嚴格的基準。

English

Many recent works have explored using language models for planning problems. One line of research focuses on translating natural language descriptions of planning tasks into structured planning languages, such as the planning domain definition language (PDDL). While this approach is promising, accurately measuring the quality of generated PDDL code continues to pose significant challenges. First, generated PDDL code is typically evaluated using planning validators that check whether the problem can be solved with a planner. This method is insufficient because a language model might generate valid PDDL code that does not align with the natural language description of the task. Second, existing evaluation sets often have natural language descriptions of the planning task that closely resemble the ground truth PDDL, reducing the challenge of the task. To bridge this gap, we introduce \benchmarkName, a benchmark designed to evaluate language models' ability to generate PDDL code from natural language descriptions of planning tasks. We begin by creating a PDDL equivalence algorithm that rigorously evaluates the correctness of PDDL code generated by language models by flexibly comparing it against a ground truth PDDL. Then, we present a dataset of 132,037 text-to-PDDL pairs across 13 different tasks, with varying levels of difficulty. Finally, we evaluate several API-access and open-weight language models that reveal this task's complexity. For example, 87.6% of the PDDL problem descriptions generated by GPT-4o are syntactically parseable, 82.2% are valid, solve-able problems, but only 35.1% are semantically correct, highlighting the need for a more rigorous benchmark for this problem.

星座館：將文本翻譯為結構化計劃語言的嚴謹基準

Planetarium: A Rigorous Benchmark for Translating Text to Structured Planning Languages

摘要

Support