PlanBench-XL: 대규모 도구 생태계에서 LLM 도구 사용 에이전트의 장기 계획 평가

초록

LLM 에이전트는 점점 더 대규모 도구 생태계에서 작동하며, 현실 세계의 작업은 관련 도구를 발견하고, 암시적 하위 목표를 추론하며, 장기적 시간 범위에 걸쳐 동적 환경에 적응해야 합니다. 그러나 기존 벤치마크는 검색이 제한된 도구 가시성 하에서의 계획을 거의 평가하지 않습니다. 이러한 격차를 해소하기 위해, 우리는 1,665개의 도구에 걸친 327개의 소매 작업으로 구성된 대화형 벤치마크인 PlanBench-XL을 소개합니다. 이는 에이전트가 사용 가능한 도구를 반복적으로 검색하고, 이를 호출하여 최종 목표를 향한 후속 호출을 위한 중간 증거를 발견할 수 있는지 테스트합니다. PlanBench-XL은 또한 선택적 차단 메커니즘을 특징으로 하며, 누락되거나 실패하거나 방해가 되는 도구 기능을 통해 현실 세계의 예측 불가능성을 시뮬레이션하여 에이전트가 중단된 경로를 감지하고 런타임에 적응하도록 강제합니다. 10개의 주요 LLM에 대한 실험은 대규모 도구 계획이 여전히 어려운 과제임을 보여줍니다. GPT-5.4는 차단이 없는 환경에서 51.90%의 정확도를 달성하지만, 가장 심각한 차단 조건에서는 11.36%로 급락합니다. 추가 분석에 따르면, 실패 시 명시적 오류 신호가 없거나 복구에 더 긴 대체 도구 사용 경로가 필요한 경우 에이전트는 특히 취약합니다. 이러한 결과는 PlanBench-XL을 에이전트 계획 실패 진단을 위한 테스트베드로 확립하며, 크고 불완전한 도구 환경에서 장기적 과제를 위한 강건한 적응형 계획의 필요성을 강조합니다.

English

LLM agents increasingly operate in large tool ecosystems, where real-world tasks require discovering relevant tools, inferring implicit sub-goals, and adapting to dynamic environments over long horizons. However, existing benchmarks rarely evaluate planning under retrieval-limited tool visibility. To address this gap, we introduce PlanBench-XL, an interactive benchmark of 327 retail tasks over 1,665 tools that tests whether agents can iteratively retrieve usable tools, invoke them to uncover intermediate evidence for subsequent calls toward the final goal. PlanBench-XL further features an optional blocking mechanism that simulates real-world unpredictability through missing, failing, or distracting tool functions, forcing agents to detect disrupted paths and adapt at runtime. Experiments on ten leading LLMs show that massive-tool planning remains challenging: while GPT-5.4 achieves 51.90% accuracy in block-free settings, it collapses to 11.36% under the most severe blocking condition. Further analysis shows that agents are especially vulnerable when failures lack explicit error signals or when recovery requires longer alternative tool-use paths. These results establish PlanBench-XL as a testbed for diagnosing agentic planning failures and highlight the need for robust adaptive planning in long-horizon tasks with large, imperfect tool environments.