超凡特工对决:工具运用强者,导航定位弱者
The Amazing Agent Race: Strong Tool Users, Weak Navigators
April 17, 2026
作者: Zae Myung Kim, Dongseok Lee, Jaehyung Kim, Vipul Raheja, Dongyeop Kang
cs.AI
摘要
现有的大语言模型智能体工具使用基准测试绝大多数呈线性特征:我们对六个基准的分析显示,55%至100%的测试案例都是2到5步的简单链式操作。我们推出《神奇智能体竞速赛》(AAR)这一创新基准,其特点在于采用有向无环图(DAG)谜题(或称"赛段"),包含分叉-聚合式工具链。我们发布了两个变体的1400个测试案例:顺序型(800赛段)与组合型(600个DAG赛段)。智能体需在维基百科中导航,执行多步工具链,并将结果汇总为可验证答案。这些赛段基于维基百科种子按四个难度级别通过程序化生成,并经过实时API验证。三项互补指标(终点准确率、维修站访问率、路障完成率)分别用于诊断导航、工具使用和算术错误。在1400个赛段上评估三种智能体框架时,最佳表现者准确率仅达37.2%。导航错误占主导(试验次数的27%至52%),工具使用错误始终低于17%,且智能体架构的重要性不亚于模型规模(Claude Code与Codex CLI均达到37%准确率,但前者token消耗量减少六倍)。AAR的组合结构揭示:智能体的失败不在于工具调用,而在于能否导航至正确页面——这一盲点是线性基准测试无法发现的。项目页面请访问:https://minnesotanlp.github.io/the-amazing-agent-race
English
Existing tool-use benchmarks for LLM agents are overwhelmingly linear: our analysis of six benchmarks shows 55 to 100% of instances are simple chains of 2 to 5 steps. We introduce The Amazing Agent Race (AAR), a benchmark featuring directed acyclic graph (DAG) puzzles (or "legs") with fork-merge tool chains. We release 1,400 instances across two variants: sequential (800 legs) and compositional (600 DAG legs). Agents must navigate Wikipedia, execute multi-step tool chains, and aggregate results into a verifiable answer. Legs are procedurally generated from Wikipedia seeds across four difficulty levels with live-API validation. Three complementary metrics (finish-line accuracy, pit-stop visit rate, and roadblock completion rate) separately diagnose navigation, tool-use, and arithmetic failures. Evaluating three agent frameworks on 1,400 legs, the best achieves only 37.2% accuracy. Navigation errors dominate (27 to 52% of trials) while tool-use errors remain below 17%, and agent architecture matters as much as model scale (Claude Code matches Codex CLI at 37% with 6x fewer tokens). The compositional structure of AAR reveals that agents fail not at calling tools but at navigating to the right pages, a blind spot invisible to linear benchmarks. The project page can be accessed at: https://minnesotanlp.github.io/the-amazing-agent-race