ChatPaper.aiChatPaper

DeepPlanning:面向可验证约束的长程智能体规划基准测试

DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints

January 26, 2026
作者: Yinger Zhang, Shutong Jiang, Renhao Li, Jianhong Tu, Yang Su, Lianghao Deng, Xudong Guo, Chenxu Lv, Junyang Lin
cs.AI

摘要

尽管智能体评估已转向长周期任务,但现有基准仍主要关注局部、步骤层面的推理,而非需要真正规划能力的全局约束优化(如时间和预算限制)。同时,当前基于大语言模型的规划基准未能充分体现现实场景中典型的信息主动获取与细粒度局部约束特性。为此,我们推出DeepPlanning——面向实际长周期智能体规划的挑战性基准。该基准包含多日旅行规划与多商品购物任务,要求智能体具备主动信息获取、局部约束推理及全局约束优化能力。在DeepPlanning上的评估表明,即使顶尖的智能体大语言模型也难以应对这些问题,凸显了可靠显式推理模式与并行工具使用对实现更优效果-效率权衡的重要性。错误分析进一步为提升智能体大语言模型的长周期规划能力指明了可行方向。我们已开源代码与数据以支持后续研究。
English
While agent evaluation has shifted toward long-horizon tasks, most benchmarks still emphasize local, step-level reasoning rather than the global constrained optimization (e.g., time and financial budgets) that demands genuine planning ability. Meanwhile, existing LLM planning benchmarks underrepresent the active information gathering and fine-grained local constraints typical of real-world settings. To address this, we introduce DeepPlanning, a challenging benchmark for practical long-horizon agent planning. It features multi-day travel planning and multi-product shopping tasks that require proactive information acquisition, local constrained reasoning, and global constrained optimization. Evaluations on DeepPlanning show that even frontier agentic LLMs struggle with these problems, highlighting the importance of reliable explicit reasoning patterns and parallel tool use for achieving better effectiveness-efficiency trade-offs. Error analysis further points to promising directions for improving agentic LLMs over long planning horizons. We open-source the code and data to support future research.
PDF101January 28, 2026