AdaPlanBench:评估大语言模型智能体在世界和用户约束下的自适应规划能力
AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints
June 4, 2026
作者: Jiayu Liu, Cheng Qian, Zhenhailong Wang, Bingxuan Li, Jiateng Liu, Heng Wang, Jeonghwan Kim, Yumeng Wang, Xiusi Chen, Yi R. Fung, Heng Ji
cs.AI
摘要
语言模型在规划现实世界问题时通常需要考虑环境约束和用户约束,这些约束在初始阶段可能未被完全明确,而是通过交互逐步显现。然而,现有基准测试对这类渐进式揭示的双重约束下的自适应规划探索仍显不足。为填补这一空白,我们提出了AdaPlanBench——一个动态交互式基准测试,用于评估大语言模型(LLM)智能体在逐步揭示的环境约束和用户约束下能否进行自适应规划与重新规划。AdaPlanBench基于307项家务任务构建,其可扩展的约束构造流程可为每项任务附加双重约束。运行时,智能体通过多轮交互协议与环境互动:只有当智能体提出的规划违反隐藏约束时,该约束才会被揭示,迫使智能体在累积反馈中迭代修正规划。这种设计使得规划极具挑战性——智能体必须从反馈中推断并追踪约束,同时高效地重新规划。对十个主流大语言模型的实验表明,在双重约束下进行自适应规划仍具挑战,最佳模型仅达到67.75%的准确率。我们进一步观察到,随着约束累积,模型性能持续下降,其中用户约束构成尤为严峻的挑战,而模型失效常源于物理基础推理薄弱与效能降低。这些结果证明了AdaPlanBench作为双重约束交互式规划测试平台的价值,并凸显了LLM智能体在动态揭示约束下实现可靠适应的关键难题。
English
Planning for real-world problems by language models often involves both world and user constraints, which may not be fully specified upfront and are progressively disclosed through interaction. However, existing benchmarks still underexplore adaptive planning under such progressively revealed dual constraints. To address this gap, we introduce AdaPlanBench, a dynamic interactive benchmark for evaluating whether Large Language Model (LLM) agents can adaptively plan and re-plan under progressively revealed world and user constraints. AdaPlanBench is built on 307 household tasks, with a scalable constraint construction pipeline that augments each task with dual constraints. At runtime, agents interact with the environment in a multi-turn protocol where hidden constraints are revealed only when the agent proposes a plan that violates them, requiring iterative plan revision under accumulating feedback. This makes planning challenging, as agents must infer and track constraints from feedback while re-planning effectively. Experiments on ten leading LLMs show that adaptive planning under dual constraints remains challenging, with the best model reaching only 67.75% accuracy. We further observe that performance degrades as more constraints accumulate, with user constraints posing a particularly large challenge and failures often stemming from weaker physical grounding and reduced effectiveness. These results establish AdaPlanBench as a testbed for dual-constrained interactive planning and highlight the challenge of reliable adaptation to dynamically revealed constraints in LLM agents.