ChatPaper.aiChatPaper

DeepPlanning:基於可驗證約束的長程智能體規劃基準測試

DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints

January 26, 2026
作者: Yinger Zhang, Shutong Jiang, Renhao Li, Jianhong Tu, Yang Su, Lianghao Deng, Xudong Guo, Chenxu Lv, Junyang Lin
cs.AI

摘要

雖然智慧體評估已轉向長時程任務,但多數基準測試仍側重局部、步驟層級的推理,而非需要真正規劃能力的全域受限優化(例如時間與財務預算)。與此同時,現有的LLM規劃基準測試未能充分體現現實場景中典型的主動資訊收集與細粒度局部約束。為解決此問題,我們推出DeepPlanning——一個針對實用長時程智慧體規劃的挑戰性基準測試。該測試以多日旅行規劃與多商品購置任務為特色,要求具備主動資訊獲取、局部受限推理及全域受限優化能力。在DeepPlanning上的評估顯示,即便是前沿的具身LLM也難以應對這些問題,凸顯了可靠的顯性推理模式與並行工具使用對於實現更佳效能-效率權衡的重要性。錯誤分析進一步指出了改進具身LLM長時程規劃能力的潛在方向。我們開源相關程式碼與資料以支援未來研究。
English
While agent evaluation has shifted toward long-horizon tasks, most benchmarks still emphasize local, step-level reasoning rather than the global constrained optimization (e.g., time and financial budgets) that demands genuine planning ability. Meanwhile, existing LLM planning benchmarks underrepresent the active information gathering and fine-grained local constraints typical of real-world settings. To address this, we introduce DeepPlanning, a challenging benchmark for practical long-horizon agent planning. It features multi-day travel planning and multi-product shopping tasks that require proactive information acquisition, local constrained reasoning, and global constrained optimization. Evaluations on DeepPlanning show that even frontier agentic LLMs struggle with these problems, highlighting the importance of reliable explicit reasoning patterns and parallel tool use for achieving better effectiveness-efficiency trade-offs. Error analysis further points to promising directions for improving agentic LLMs over long planning horizons. We open-source the code and data to support future research.
PDF101January 28, 2026