플라네타리움: 텍스트에서 구조화된 계획 언어로의 번역을 위한 엄격한 벤치마크

초록

최근 많은 연구들이 언어 모델을 계획 문제에 활용하는 방법을 탐구해 왔다. 한 연구 분야는 계획 작업에 대한 자연어 설명을 계획 도메인 정의 언어(PDDL)와 같은 구조화된 계획 언어로 변환하는 데 초점을 맞추고 있다. 이 접근법은 유망하지만, 생성된 PDDL 코드의 품질을 정확하게 측정하는 것은 여전히 상당한 과제로 남아 있다. 첫째, 생성된 PDDL 코드는 일반적으로 계획 검증기를 사용하여 평가되며, 이는 계획자로 문제를 해결할 수 있는지 여부를 확인한다. 이 방법은 언어 모델이 작업의 자연어 설명과 일치하지 않는 유효한 PDDL 코드를 생성할 가능성이 있기 때문에 불충분하다. 둘째, 기존 평가 데이터셋은 종종 계획 작업의 자연어 설명이 실제 PDDL과 매우 유사하여 작업의 난이도를 낮추는 경향이 있다. 이러한 격차를 해소하기 위해, 우리는 계획 작업의 자연어 설명에서 PDDL 코드를 생성하는 언어 모델의 능력을 평가하기 위해 설계된 벤치마크인 \benchmarkName을 소개한다. 우리는 먼저 언어 모델이 생성한 PDDL 코드의 정확성을 엄격하게 평가하기 위해 실제 PDDL과 유연하게 비교하는 PDDL 동등성 알고리즘을 개발한다. 그런 다음, 13가지 다양한 작업에 걸쳐 난이도가 다른 132,037개의 텍스트-PDDL 쌍으로 구성된 데이터셋을 제시한다. 마지막으로, 이 작업의 복잡성을 보여주는 여러 API 접근 및 오픈 웨이트 언어 모델을 평가한다. 예를 들어, GPT-4o가 생성한 PDDL 문제 설명의 87.6%가 구문적으로 파싱 가능하고, 82.2%가 유효하며 해결 가능한 문제이지만, 단 35.1%만이 의미적으로 정확하다는 점은 이 문제에 대한 더 엄격한 벤치마크의 필요성을 강조한다.

English

Many recent works have explored using language models for planning problems. One line of research focuses on translating natural language descriptions of planning tasks into structured planning languages, such as the planning domain definition language (PDDL). While this approach is promising, accurately measuring the quality of generated PDDL code continues to pose significant challenges. First, generated PDDL code is typically evaluated using planning validators that check whether the problem can be solved with a planner. This method is insufficient because a language model might generate valid PDDL code that does not align with the natural language description of the task. Second, existing evaluation sets often have natural language descriptions of the planning task that closely resemble the ground truth PDDL, reducing the challenge of the task. To bridge this gap, we introduce \benchmarkName, a benchmark designed to evaluate language models' ability to generate PDDL code from natural language descriptions of planning tasks. We begin by creating a PDDL equivalence algorithm that rigorously evaluates the correctness of PDDL code generated by language models by flexibly comparing it against a ground truth PDDL. Then, we present a dataset of 132,037 text-to-PDDL pairs across 13 different tasks, with varying levels of difficulty. Finally, we evaluate several API-access and open-weight language models that reveal this task's complexity. For example, 87.6% of the PDDL problem descriptions generated by GPT-4o are syntactically parseable, 82.2% are valid, solve-able problems, but only 35.1% are semantically correct, highlighting the need for a more rigorous benchmark for this problem.

플라네타리움: 텍스트에서 구조화된 계획 언어로의 번역을 위한 엄격한 벤치마크

Planetarium: A Rigorous Benchmark for Translating Text to Structured Planning Languages

초록

Support