도구가 실패할 때: LLM 에이전트의 동적 재계획 및 이상 복구 벤치마킹

초록

기존 벤치마크는 대규모 언어 모델(LLM)의 도구 통합 추론(TIR)을 이상적인 ‘해피 패스’ 상황에서만 평가하며, 실제 환경에서 발생하는 도구 오류를 대부분 간과해 왔다. 본 연구에서는 TIR 에이전트의 동적 경로 탐색 및 오류 복구 능력을 평가하기 위한 벤치마크인 ToolMaze를 제안한다. 체계적 재계획과 맹목적 시행착오를 구분하기 위해 ToolMaze는 2차원 설계를 채택한다. 즉, DAG 기반 위상 복잡성과 2×2 분류체계(명시적/암시적, 일시적/영구적)의 도구 교란을 포함한다. 평가 결과, 교란은 거의 모든 모델의 성능을 저하시켰으며, 특히 암시적 의미적 오류 상황에서 가장 큰 성능 하락이 관찰되었다. 손상된 출력에 대한 과도한 시스템적 신뢰로 인해, 이러한 시나리오에서 교란 복구율(PRR)은 약 37% 급감하였고, 복잡한 위상 구조는 에이전트를 무의미한 시행착오 루프에 빠뜨렸다. 중요한 점은, 에이전트의 내결함성은 모델 규모에 따라 기본 작업 실행보다 3.66배 느리게 향상된다는 사실이다. 이는 동적 재계획이 모델 스케일링이나 프롬프팅으로 해결되지 않는 별개의 병목 현상임을 시사한다. 데이터와 코드는 https://github.com/Zhudongsheng75/ToolMaze에서 확인할 수 있다.

English

Existing benchmarks evaluate Tool-Integrated Reasoning (TIR) in LLMs on idealized ''happy paths'', largely overlooking real-world tool failures. We introduce ToolMaze, a benchmark for dynamic path discovery and error recovery in TIR agents. To separate systematic replanning from blind trial-and-error, ToolMaze adopts a two-dimensional design: DAG-based topological complexity and a 2 times 2 taxonomy of tool perturbations (explicit/implicit, transient/permanent). Evaluations show that perturbations degrade performance across nearly all models, with the sharpest drops under implicit semantic failures. Driven by systemic over-trust in corrupted outputs, Perturbation Recovery Rate (PRR) plummets by around 37\% in these scenarios, while complex topologies trap agents in futile trial-and-error loops. Crucially, agentic fault-tolerance improves with model scale 3.66times slower than basic task execution, highlighting dynamic replanning as a distinct bottleneck unaddressed by model scaling or prompting. Data and code are available at https://github.com/Zhudongsheng75/ToolMaze.