De Verbazingwekkende Agentenrace: Sterke Hulpmiddelgebruikers, Zwakke Navigators

Samenvatting

Bestaande benchmarks voor toolgebruik door LLM-agenten zijn overweldigend lineair: onze analyse van zes benchmarks toont aan dat 55 tot 100% van de instanties eenvoudige ketens van 2 tot 5 stappen zijn. Wij introduceren The Amazing Agent Race (AAR), een benchmark met gerichte acyclische graaf (DAG) puzzels (of "etappes") met fork-merge toolketens. Wij publiceren 1.400 instanties in twee varianten: sequentieel (800 etappes) en compositioneel (600 DAG-etappes). Agenten moeten door Wikipedia navigeren, meerstaps toolketens uitvoeren en resultaten aggregeren tot een verifieerbaar antwoord. Etappes worden procedureel gegenereerd vanuit Wikipedia-startpunten over vier moeilijkheidsniveaus met live-API-validatie. Drie complementaire metrieken (eindstreepnauwkeurigheid, pitstop-bezoekfrequentie en roadblock-voltooiingspercentage) diagnosticeren afzonderlijk navigatie-, toolgebruiks- en rekenfouten. Na evaluatie van drie agentframeworks op 1.400 etappes, behaalt de beste slechts 37,2% nauwkeurigheid. Navigatiefouten domineren (27 tot 52% van de pogingen), terwijl toolgebruiksfouten onder de 17% blijven, en de agentarchitectuur is even belangrijk als de modelschaal (Claude Code evenaart Codex CLI op 37% met 6x minder tokens). De compositionele structuur van AAR onthult dat agenten niet falen in het aanroepen van tools, maar in het navigeren naar de juiste pagina's, een blinde vlek die onzichtbaar is voor lineaire benchmarks. De projectpagina is te vinden op: https://minnesotanlp.github.io/the-amazing-agent-race

English

Existing tool-use benchmarks for LLM agents are overwhelmingly linear: our analysis of six benchmarks shows 55 to 100% of instances are simple chains of 2 to 5 steps. We introduce The Amazing Agent Race (AAR), a benchmark featuring directed acyclic graph (DAG) puzzles (or "legs") with fork-merge tool chains. We release 1,400 instances across two variants: sequential (800 legs) and compositional (600 DAG legs). Agents must navigate Wikipedia, execute multi-step tool chains, and aggregate results into a verifiable answer. Legs are procedurally generated from Wikipedia seeds across four difficulty levels with live-API validation. Three complementary metrics (finish-line accuracy, pit-stop visit rate, and roadblock completion rate) separately diagnose navigation, tool-use, and arithmetic failures. Evaluating three agent frameworks on 1,400 legs, the best achieves only 37.2% accuracy. Navigation errors dominate (27 to 52% of trials) while tool-use errors remain below 17%, and agent architecture matters as much as model scale (Claude Code matches Codex CLI at 37% with 6x fewer tokens). The compositional structure of AAR reveals that agents fail not at calling tools but at navigating to the right pages, a blind spot invisible to linear benchmarks. The project page can be accessed at: https://minnesotanlp.github.io/the-amazing-agent-race

De Verbazingwekkende Agentenrace: Sterke Hulpmiddelgebruikers, Zwakke Navigators

The Amazing Agent Race: Strong Tool Users, Weak Navigators

Samenvatting

Support