PlanBench-XL：评估大规模工具生态系统中LLM工具使用智能体的长程规划能力

摘要

大型语言模型代理日益在庞大的工具生态系统中运作，其中真实世界任务需要发现相关工具、推断隐含子目标，并在长期任务中适应动态环境。然而，现有基准测试很少评估在检索受限的工具可见性下的规划能力。为弥补这一空白，我们推出了PlanBench-XL——一个包含327个零售任务、涉及1665种工具的交互式基准测试，用于检验代理能否迭代检索可用工具、调用它们以揭示中间证据，从而为后续调用最终目标服务。PlanBench-XL还具备可选的阻塞机制，通过缺失、失效或干扰工具函数模拟真实世界的不确定性，迫使代理检测中断路径并在运行时动态适应。对十款领先大语言模型的实验表明，大规模工具规划仍具挑战性：在无阻塞情况下，GPT-5.4达到51.90%的准确率，但在最严重的阻塞条件下骤降至11.36%。进一步分析显示，当失败缺乏明确错误信号，或恢复需要更长的替代工具使用路径时，代理尤为脆弱。这些结果证实PlanBench-XL是诊断代理规划失败的测试平台，并凸显了在包含大规模、不完美工具环境的长期任务中，进行稳健自适应规划的必要性。

English

LLM agents increasingly operate in large tool ecosystems, where real-world tasks require discovering relevant tools, inferring implicit sub-goals, and adapting to dynamic environments over long horizons. However, existing benchmarks rarely evaluate planning under retrieval-limited tool visibility. To address this gap, we introduce PlanBench-XL, an interactive benchmark of 327 retail tasks over 1,665 tools that tests whether agents can iteratively retrieve usable tools, invoke them to uncover intermediate evidence for subsequent calls toward the final goal. PlanBench-XL further features an optional blocking mechanism that simulates real-world unpredictability through missing, failing, or distracting tool functions, forcing agents to detect disrupted paths and adapt at runtime. Experiments on ten leading LLMs show that massive-tool planning remains challenging: while GPT-5.4 achieves 51.90% accuracy in block-free settings, it collapses to 11.36% under the most severe blocking condition. Further analysis shows that agents are especially vulnerable when failures lack explicit error signals or when recovery requires longer alternative tool-use paths. These results establish PlanBench-XL as a testbed for diagnosing agentic planning failures and highlight the need for robust adaptive planning in long-horizon tasks with large, imperfect tool environments.