星座馆：将文本翻译为结构化规划语言的严格基准

摘要

近年来，许多研究作品探讨了利用语言模型解决规划问题的可能性。一系列研究关注将规划任务的自然语言描述转换为结构化规划语言，例如规划领域定义语言（PDDL）。虽然这种方法很有前景，但准确衡量生成的PDDL代码质量仍然存在重大挑战。首先，生成的PDDL代码通常通过规划验证器进行评估，以检查问题是否可以通过规划器解决。这种方法不够，因为语言模型可能生成有效的PDDL代码，但与任务的自然语言描述不一致。其次，现有的评估集通常包含与真实PDDL密切相似的规划任务的自然语言描述，降低了任务的挑战性。为了弥合这一差距，我们引入了\benchmarkName，这是一个旨在评估语言模型从规划任务的自然语言描述生成PDDL代码能力的基准。我们首先创建了一个PDDL等价算法，通过灵活地将生成的PDDL代码与真实PDDL进行比较，严格评估了语言模型生成的正确性。然后，我们提供了一个包含13种不同任务、共132,037个文本到PDDL对的数据集，难度各异。最后，我们评估了几种API访问和开放权重的语言模型，揭示了这一任务的复杂性。例如，GPT-4o生成的87.6%的PDDL问题描述在语法上可解析，82.2%是有效的、可解决的问题，但只有35.1%在语义上正确，突显了对这一问题更严格基准的需求。

English

Many recent works have explored using language models for planning problems. One line of research focuses on translating natural language descriptions of planning tasks into structured planning languages, such as the planning domain definition language (PDDL). While this approach is promising, accurately measuring the quality of generated PDDL code continues to pose significant challenges. First, generated PDDL code is typically evaluated using planning validators that check whether the problem can be solved with a planner. This method is insufficient because a language model might generate valid PDDL code that does not align with the natural language description of the task. Second, existing evaluation sets often have natural language descriptions of the planning task that closely resemble the ground truth PDDL, reducing the challenge of the task. To bridge this gap, we introduce \benchmarkName, a benchmark designed to evaluate language models' ability to generate PDDL code from natural language descriptions of planning tasks. We begin by creating a PDDL equivalence algorithm that rigorously evaluates the correctness of PDDL code generated by language models by flexibly comparing it against a ground truth PDDL. Then, we present a dataset of 132,037 text-to-PDDL pairs across 13 different tasks, with varying levels of difficulty. Finally, we evaluate several API-access and open-weight language models that reveal this task's complexity. For example, 87.6% of the PDDL problem descriptions generated by GPT-4o are syntactically parseable, 82.2% are valid, solve-able problems, but only 35.1% are semantically correct, highlighting the need for a more rigorous benchmark for this problem.

星座馆：将文本翻译为结构化规划语言的严格基准

Planetarium: A Rigorous Benchmark for Translating Text to Structured Planning Languages

摘要

Support