エージェントのためのインターネット規模のトレーニングに向けて

要旨

Webナビゲーションエージェントのトレーニングにおける主要なアプローチは、一連の人気ウェブサイトと手書きタスクのための人間のデモンストレーションを収集しますが、人間のデータは効率的なリソースではないことが明らかになっています。労力を要する人間の注釈なしでエージェントのインターネット規模のトレーニングを容易にするパイプラインを開発します。最初の段階では、LLMが多様な150kのウェブサイトのためのタスクを生成します。次の段階では、LLMエージェントがタスクを完了し、軌跡を生成します。最後の段階では、LLMが軌跡をレビューし、成功を判断します。言語モデルは、有害なコンテンツを97%の精度で検出およびフィルタリングし、89%の割合で実行可能なタスクを生成し、82.6%の精度で成功した軌跡を判断する点で人間の注釈者と競合しています。パイプラインをスケーリングすると、Llama 3.1 70Bに基づくエージェントは、150kサイトのタスクの16.7%を解決します。当社のパイプラインで生成されたデータでトレーニングすることは、人間のデモンストレーションでのトレーニングと競合しています。Mind2WebとWebLINXから派生したデータが限られた状況では、当社のパイプラインと人間のデータの混合でトレーニングされたエージェントによるステップ精度が最大+89.5%および+122.1%向上します。これらのベンチマークから利用可能なすべての人間のデータでエージェントをトレーニングすると、エージェントは多様な実際のサイトに一般化できず、当社のデータを追加することで、WebLINXでは+149.0%、Mind2Webでは+156.3%向上します。コードはこちらで入手可能：data-for-agents.github.io。

English

The predominant approach for training web navigation agents gathers human demonstrations for a set of popular websites and hand-written tasks, but it is becoming clear that human data are an inefficient resource. We develop a pipeline to facilitate Internet-scale training for agents without laborious human annotations. In the first stage, an LLM generates tasks for 150k diverse websites. In the next stage, LLM agents complete tasks and produce trajectories. In the final stage, an LLM reviews the trajectories and judges their success. Language models are competitive with human annotators, detecting and filtering out harmful content with an accuracy of 97%, generating feasible tasks with an 89% rate, and judging successful trajectories with an 82.6% accuracy. Scaling the pipeline, agents based on Llama 3.1 70B solve 16.7% of tasks for 150k sites. Training on the data generated by our pipeline is competitive with training on human demonstrations. In data-limited settings derived from Mind2Web and WebLINX, we improve Step Accuracy by up to +89.5% and +122.1% respectively for agents trained on mixtures of data from our pipeline, and human data. When training agents with all available human data from these benchmarks, agents fail to generalize to diverse real sites, and adding our data improves their generalization by +149.0% for WebLINX and +156.3% for Mind2Web. Code will be available at: data-for-agents.github.io.

エージェントのためのインターネット規模のトレーニングに向けて

Towards Internet-Scale Training For Agents

要旨

Support