PlanningBench: 대규모 언어 모델 평가 및 훈련을 위한 확장 가능하고 검증 가능한 계획 데이터 생성

초록

계획 수립(planning)은 대규모 언어 모델(LLM)에게 필수적인 능력이다. 이는 복잡한 작업을 수행할 때 모델이 목표, 제약 조건, 자원 및 장기적 결과를 조정하여 실행 가능하고 검증 가능한 해결책을 도출해야 하기 때문이다. 그러나 기존의 계획 수립 벤치마크는 일반적으로 계획 데이터를 통제 가능한 생성 대상이 아닌 고정된 사례 모음으로 취급한다. 이는 시나리오의 범위를 제한하고, 난이도를 구조적 원인이 아닌 표면적 근거에 연결하며, 확장 가능한 생성, 자동 검증, 또는 계획 중심 학습을 위한 지원이 부족하다. 본 연구에서는 평가와 학습 모두를 위해 확장 가능하고 다양하며 검증 가능한 계획 데이터를 생성하는 프레임워크인 PlanningBench를 소개한다. PlanningBench는 실제 계획 시나리오에서 출발하여 실무적 워크플로를 30개 이상의 작업 유형, 하위 작업, 제약 조건군 및 난이도 요인으로 구성된 구조화된 분류 체계로 추상화한다. 이 분류 체계를 바탕으로, 제약 조건 기반 합성 파이프라인은 적응형 난이도 조절, 품질 필터링 및 인스턴스 수준의 검증 체크리스트를 통해 독립적인 계획 문제를 생성한다. 이를 통해 계획 데이터 구축을 고정된 벤치마크 수집에서 통제 가능한 생성으로 전환하면서도 현실적인 작업 기반을 유지한다. PlanningBench를 사용하여 오픈소스 및 폐쇄형 최첨단 LLM을 평가한 결과, 현재 모델들은 결합된 제약 조건 하에서 완전한 해결책을 생성하는 데 여전히 어려움을 겪는 것으로 나타났다. 평가 외에도, 검증된 PlanningBench 데이터에 대한 강화 학습은 보지 못한 계획 벤치마크와 더 광범위한 명령 수행 작업에서 성능을 향상시킨다. 추가 분석에 따르면, 결정적이거나 명확하게 지정된 최적 해결책은 더 명확한 보상 신호와 더 안정적인 학습 동역학을 제공한다. 종합하면, PlanningBench는 LLM의 일반화 가능한 계획 능력을 진단하고 개선하기 위한 통제 가능한 계획 데이터 소스를 제공한다.

English

Planning is a fundamental capability for large language models (LLMs) because such complex tasks require models to coordinate goals, constraints, resources, and long-term consequences into executable and verifiable solutions. Existing planning benchmarks, however, usually treat planning data as fixed collections of instances rather than controllable generation targets. This limits scenario coverage, ties difficulty to surface-level proxies rather than structural sources, and offers limited support for scalable generation, automatic verification, or planning-oriented training. We introduce PlanningBench, a framework for generating scalable, diverse, and verifiable planning data for both evaluation and training. PlanningBench starts from real planning scenarios and abstracts practical workflows into a structured taxonomy of more than 30 task types, subtasks, constraint families, and difficulty factors. Guided by this taxonomy, a constraint-driven synthesis pipeline instantiates self-contained planning problems with adaptive difficulty control, quality filtering, and instance-level verification checklists. This shifts planning data construction from fixed benchmark collection to controllable generation while preserving realistic task grounding. We use PlanningBench to evaluate open-source and closed-source frontier LLMs, and find that current models still struggle to produce complete solutions under coupled constraints. Beyond evaluation, reinforcement learning on verified PlanningBench data improves performance on unseen planning benchmarks and broader instruction-following tasks. Further analysis suggests that determinate or well-specified optimal solutions provide clearer reward signals and more stable training dynamics. Overall, PlanningBench provides a controllable source of planning data for diagnosing and improving generalizable planning abilities in LLMs.