プラネタリウム：テキストから構造化計画言語への翻訳のための厳密なベンチマーク

要旨

近年、多くの研究が計画問題に対する言語モデルの利用を探求している。その一つの研究ラインは、計画タスクの自然言語記述を計画ドメイン定義言語（PDDL）のような構造化された計画言語に翻訳することに焦点を当てている。このアプローチは有望であるものの、生成されたPDDLコードの品質を正確に測定することは依然として大きな課題となっている。第一に、生成されたPDDLコードは通常、プランナーで問題が解決可能かどうかを確認する計画検証ツールを使用して評価される。この方法は不十分である。なぜなら、言語モデルがタスクの自然言語記述に合致しない有効なPDDLコードを生成する可能性があるからである。第二に、既存の評価セットでは、計画タスクの自然言語記述が真のPDDLに非常に近い場合が多く、タスクの難易度が低下している。このギャップを埋めるため、我々は\benchmarkNameを導入する。これは、計画タスクの自然言語記述からPDDLコードを生成する言語モデルの能力を評価するために設計されたベンチマークである。まず、言語モデルによって生成されたPDDLコードの正確性を柔軟に真のPDDLと比較することで厳密に評価するPDDL等価性アルゴリズムを作成する。次に、13の異なるタスクにわたる132,037のテキストとPDDLのペアからなるデータセットを提示し、その難易度は様々である。最後に、このタスクの複雑さを明らかにするために、いくつかのAPIアクセス型およびオープンウェイトの言語モデルを評価する。例えば、GPT-4oによって生成されたPDDL問題記述の87.6%は構文的に解析可能であり、82.2%は有効で解決可能な問題であるが、意味的に正しいのは35.1%のみであり、この問題に対するより厳密なベンチマークの必要性が浮き彫りになっている。

English

Many recent works have explored using language models for planning problems. One line of research focuses on translating natural language descriptions of planning tasks into structured planning languages, such as the planning domain definition language (PDDL). While this approach is promising, accurately measuring the quality of generated PDDL code continues to pose significant challenges. First, generated PDDL code is typically evaluated using planning validators that check whether the problem can be solved with a planner. This method is insufficient because a language model might generate valid PDDL code that does not align with the natural language description of the task. Second, existing evaluation sets often have natural language descriptions of the planning task that closely resemble the ground truth PDDL, reducing the challenge of the task. To bridge this gap, we introduce \benchmarkName, a benchmark designed to evaluate language models' ability to generate PDDL code from natural language descriptions of planning tasks. We begin by creating a PDDL equivalence algorithm that rigorously evaluates the correctness of PDDL code generated by language models by flexibly comparing it against a ground truth PDDL. Then, we present a dataset of 132,037 text-to-PDDL pairs across 13 different tasks, with varying levels of difficulty. Finally, we evaluate several API-access and open-weight language models that reveal this task's complexity. For example, 87.6% of the PDDL problem descriptions generated by GPT-4o are syntactically parseable, 82.2% are valid, solve-able problems, but only 35.1% are semantically correct, highlighting the need for a more rigorous benchmark for this problem.

プラネタリウム：テキストから構造化計画言語への翻訳のための厳密なベンチマーク

Planetarium: A Rigorous Benchmark for Translating Text to Structured Planning Languages

要旨

Support