Gestructureerde Destillatie van Webagentcapaciteiten Maakt Generalisatie Mogelijk

Samenvatting

Frontier-LLM's kunnen complexe websites navigeren, maar hun kosten en afhankelijkheid van third-party API's maken lokale implementatie onpraktisch. Wij introduceren 'Agent-as-Annotators', een raamwerk dat de synthetische generatie van trajecten voor webagents structureert naar analogie van menselijke annotatierollen, waarbij de Task Designer, Annotator en Supervisor worden vervangen door modulaire LLM-componenten. Met Gemini 3 Pro als 'teacher' genereren we 3.000 trajecten in zes webomgevingen en fine-tunen een studentmodel met 9B parameters via pure supervised learning op de 2.322 trajecten die de kwaliteitsfiltering doorstaan. Het resulterende model behaalt 41,5% op WebArena, wat gesloten modellen zoals Claude 3.5 Sonnet (36,0%) en GPT-4o (31,5%) overtreft onder hetzelfde evaluatieprotocol, en verdubbelt bijna het vorige beste open-weight resultaat (Go-Browse, 21,7%). De capaciteiten transfereren naar onbekende omgevingen, met een winst van 18,2 procentpunt op WorkArena L1 (een enterprise-platform dat niet tijdens de training werd gezien) en consistente verbeteringen op drie aanvullende benchmarks. Ablatieonderzoek bevestigt dat elke pijplijncomponent betekenisvol bijdraagt, waarbij Judge-filtering, evaluatiehints en redeneersporen elk meetbare winst opleveren. Deze resultaten tonen aan dat gestructureerde trajectensynthese met een enkele frontier-teacher voldoende is om competitieve, lokaal inzetbare webagents te produceren. Projectpagina: https://agent-as-annotators.github.io

English

Frontier LLMs can navigate complex websites, but their cost and reliance on third-party APIs make local deployment impractical. We introduce Agent-as-Annotators, a framework that structures synthetic trajectory generation for web agents by analogy to human annotation roles, replacing the Task Designer, Annotator, and Supervisor with modular LLM components. Using Gemini 3 Pro as teacher, we generate 3,000 trajectories across six web environments and fine-tune a 9B-parameter student with pure supervised learning on the 2,322 that pass quality filtering. The resulting model achieves 41.5% on WebArena, surpassing closed-source models such as Claude 3.5 Sonnet (36.0%) and GPT-4o (31.5%) under the same evaluation protocol, and nearly doubling the previous best open-weight result (Go-Browse, 21.7%). Capabilities transfer to unseen environments, with an 18.2 percentage point gain on WorkArena L1 (an enterprise platform never seen during training) and consistent improvements across three additional benchmarks. Ablations confirm that each pipeline component contributes meaningfully, with Judge filtering, evaluation hints, and reasoning traces each accounting for measurable gains. These results demonstrate that structured trajectory synthesis from a single frontier teacher is sufficient to produce competitive, locally deployable web agents. Project page: https://agent-as-annotators.github.io

Gestructureerde Destillatie van Webagentcapaciteiten Maakt Generalisatie Mogelijk

Structured Distillation of Web Agent Capabilities Enables Generalization

Samenvatting

Support