PlanningBench:生成可扩展且可驗證的規劃資料,用於評估與訓練大型語言模型
PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models
May 20, 2026
作者: Ziliang Zhao, Zenan Xu, Shuting Wang, Hongjin Qian, Yan Lei, Minda Hu, Zhao Wang, Shihan Dou, Zhicheng Dou, Pluto Zhou
cs.AI
摘要
規劃能力是大型語言模型(LLM)的基本能力,因為這類複雜任務需要模型將目標、限制條件、資源與長期後果協調成可執行且可驗證的解決方案。然而,現有的規劃基準通常將規劃資料視為固定的實例集合,而非可控的生成目標。這限制了場景覆蓋範圍,將難度與表面代理(而非結構性來源)掛鉤,並且對可擴展生成、自動驗證或規劃導向訓練提供的支援有限。我們提出 PlanningBench,這是一個用於生成可擴展、多樣化且可驗證的規劃資料的框架,適用於評估與訓練。PlanningBench 從真實的規劃場景出發,將實務工作流程抽象化,形成一個包含超過 30 種任務類型、子任務、限制族與難度因素的分類結構。在此分類的引導下,一個基於限制驅動的合成管線會實例化出具有自適應難度控制、品質過濾與實例層級驗證清單的自足規劃問題。這使得規劃資料的構建從固定的基準收集轉向可控的生成,同時保留真實任務的基礎。我們使用 PlanningBench 評估開源與閉源的前沿 LLM,發現目前的模型在耦合限制下仍難以產出完整的解決方案。除了評估之外,在經過驗證的 PlanningBench 資料上進行強化學習,可提升模型在未見過的規劃基準與更廣泛的指令遵循任務上的表現。進一步的分析顯示,確定性或明確指定的最優解能提供更清晰的獎勵訊號與更穩定的訓練動態。總體而言,PlanningBench 提供了一個可控的規劃資料來源,用於診斷與提升 LLM 的通用規劃能力。
English
Planning is a fundamental capability for large language models (LLMs) because such complex tasks require models to coordinate goals, constraints, resources, and long-term consequences into executable and verifiable solutions. Existing planning benchmarks, however, usually treat planning data as fixed collections of instances rather than controllable generation targets. This limits scenario coverage, ties difficulty to surface-level proxies rather than structural sources, and offers limited support for scalable generation, automatic verification, or planning-oriented training. We introduce PlanningBench, a framework for generating scalable, diverse, and verifiable planning data for both evaluation and training. PlanningBench starts from real planning scenarios and abstracts practical workflows into a structured taxonomy of more than 30 task types, subtasks, constraint families, and difficulty factors. Guided by this taxonomy, a constraint-driven synthesis pipeline instantiates self-contained planning problems with adaptive difficulty control, quality filtering, and instance-level verification checklists. This shifts planning data construction from fixed benchmark collection to controllable generation while preserving realistic task grounding. We use PlanningBench to evaluate open-source and closed-source frontier LLMs, and find that current models still struggle to produce complete solutions under coupled constraints. Beyond evaluation, reinforcement learning on verified PlanningBench data improves performance on unseen planning benchmarks and broader instruction-following tasks. Further analysis suggests that determinate or well-specified optimal solutions provide clearer reward signals and more stable training dynamics. Overall, PlanningBench provides a controllable source of planning data for diagnosing and improving generalizable planning abilities in LLMs.