PlanningBench:生成可扩展且可验证的规划数据,用于评估和训练大语言模型
PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models
May 20, 2026
作者: Ziliang Zhao, Zenan Xu, Shuting Wang, Hongjin Qian, Yan Lei, Minda Hu, Zhao Wang, Shihan Dou, Zhicheng Dou, Pluto Zhou
cs.AI
摘要
规划能力是大语言模型(LLM)的一项基础技能,因为复杂任务要求模型将目标、约束、资源和长期后果协调为可执行且可验证的解决方案。然而,现有的规划基准通常将规划数据视为固定的实例集合,而非可控的生成目标。这限制了场景覆盖范围,将难度与表面层面的代理指标而非结构性根源挂钩,并且对可扩展生成、自动验证或面向规划的训练支持有限。我们提出PlanningBench,这是一个用于生成可扩展、多样化且可验证的规划数据的框架,既可用于评估也可用于训练。PlanningBench从真实规划场景出发,将实际工作流程抽象为包含30多种任务类型、子任务、约束族和难度因素的结构化分类体系。在该分类体系的指导下,一种约束驱动的合成流程能实例化包含自适应难度控制、质量过滤和实例级验证清单的自包含规划问题。这使规划数据构建从固定的基准集合转变为可控生成,同时保留了现实的任务基础。我们利用PlanningBench评估了开源和闭源前沿LLM,发现当前模型在耦合约束下仍难以生成完整解决方案。除评估外,基于已验证的PlanningBech数据的强化学习可提升模型在未见过的规划基准及更广泛的指令遵循任务上的表现。进一步分析表明,确定性或明确指定的最优解能提供更清晰的奖励信号和更稳定的训练动态。总体而言,PlanningBench为诊断和提升LLM的通用规划能力提供了可控的规划数据来源。
English
Planning is a fundamental capability for large language models (LLMs) because such complex tasks require models to coordinate goals, constraints, resources, and long-term consequences into executable and verifiable solutions. Existing planning benchmarks, however, usually treat planning data as fixed collections of instances rather than controllable generation targets. This limits scenario coverage, ties difficulty to surface-level proxies rather than structural sources, and offers limited support for scalable generation, automatic verification, or planning-oriented training. We introduce PlanningBench, a framework for generating scalable, diverse, and verifiable planning data for both evaluation and training. PlanningBench starts from real planning scenarios and abstracts practical workflows into a structured taxonomy of more than 30 task types, subtasks, constraint families, and difficulty factors. Guided by this taxonomy, a constraint-driven synthesis pipeline instantiates self-contained planning problems with adaptive difficulty control, quality filtering, and instance-level verification checklists. This shifts planning data construction from fixed benchmark collection to controllable generation while preserving realistic task grounding. We use PlanningBench to evaluate open-source and closed-source frontier LLMs, and find that current models still struggle to produce complete solutions under coupled constraints. Beyond evaluation, reinforcement learning on verified PlanningBench data improves performance on unseen planning benchmarks and broader instruction-following tasks. Further analysis suggests that determinate or well-specified optimal solutions provide clearer reward signals and more stable training dynamics. Overall, PlanningBench provides a controllable source of planning data for diagnosing and improving generalizable planning abilities in LLMs.