AdaPlanBench:在世界與用戶約束下評估大語言模型智能體的自適應規劃能力
AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints
June 4, 2026
作者: Jiayu Liu, Cheng Qian, Zhenhailong Wang, Bingxuan Li, Jiateng Liu, Heng Wang, Jeonghwan Kim, Yumeng Wang, Xiusi Chen, Yi R. Fung, Heng Ji
cs.AI
摘要
語言模型在規劃現實世界問題時,通常涉及世界約束與使用者約束,這兩類約束可能無法事先完整指定,而是透過互動逐步揭露。然而,現有基準仍較少探討在此種逐步揭露的雙重約束下進行適應性規劃。為填補此缺口,我們提出 AdaPlanBench,這是一個動態互動式基準,用於評估大型語言模型(LLM)智能體是否能根據逐步揭露的世界約束與使用者約束,進行適應性規劃與重新規劃。AdaPlanBench 基於 307 項家務任務,並具備可擴展的約束構建流程,可為每項任務附加雙重約束。在運行時,智能體透過多輪協議與環境互動,其中隱藏約束僅在智能體提出違反該約束的規劃時才會揭露,迫使智能體在累積反饋下反覆修正規劃。這使得規劃任務極具挑戰性,因為智能體必須從反饋中推斷並追蹤約束,同時有效進行重新規劃。針對十個領先 LLM 的實驗結果顯示,在雙重約束下進行適應性規劃仍具挑戰性,最佳模型僅達到 67.75% 的準確率。我們進一步觀察到,隨著約束累積越多,效能會隨之下降,其中使用者約束尤其構成重大挑戰,而失敗往往源於較弱的物理基礎推理能力與效能降低。這些結果確立了 AdaPlanBench 作為雙重約束互動式規劃的測試平台,並凸顯了 LLM 智能體在動態揭露的約束下進行可靠適應的挑戰。
English
Planning for real-world problems by language models often involves both world and user constraints, which may not be fully specified upfront and are progressively disclosed through interaction. However, existing benchmarks still underexplore adaptive planning under such progressively revealed dual constraints. To address this gap, we introduce AdaPlanBench, a dynamic interactive benchmark for evaluating whether Large Language Model (LLM) agents can adaptively plan and re-plan under progressively revealed world and user constraints. AdaPlanBench is built on 307 household tasks, with a scalable constraint construction pipeline that augments each task with dual constraints. At runtime, agents interact with the environment in a multi-turn protocol where hidden constraints are revealed only when the agent proposes a plan that violates them, requiring iterative plan revision under accumulating feedback. This makes planning challenging, as agents must infer and track constraints from feedback while re-planning effectively. Experiments on ten leading LLMs show that adaptive planning under dual constraints remains challenging, with the best model reaching only 67.75% accuracy. We further observe that performance degrades as more constraints accumulate, with user constraints posing a particularly large challenge and failures often stemming from weaker physical grounding and reduced effectiveness. These results establish AdaPlanBench as a testbed for dual-constrained interactive planning and highlight the challenge of reliable adaptation to dynamically revealed constraints in LLM agents.