AutoLibra: オープンエンドなフィードバックからのエージェントメトリック誘導

要旨

エージェントの評価と最適化は主にタスク成功率の指標に基づいて行われており、これらは大まかで、専門家による手動設計に依存し、中間段階で現れる行動を適切に評価できていない。我々はAutoLibraというエージェント評価のフレームワークを提案する。これは、例えば「ボタンが無効になっている場合、再度クリックしないでください」や「このエージェントは自分で何をするかを決定する際に自律性が高すぎる」といった、オープンエンドな人間のフィードバックを、エージェントの軌跡における細かい行動を評価する指標に変換する。AutoLibraは、フィードバックをエージェントの行動に基づいて具体化し、類似した肯定的および否定的な行動をクラスタリングし、明確な定義と具体的な例を持つ具体的な指標を作成することでこれを実現する。これらの指標は、LLM-as-a-Judge（評価者としての大規模言語モデル）を促すために使用できる。さらに、我々はオープンなフィードバックと（誘導された）指標セットの整合性を評価するための2つのメタ指標「カバレッジ」と「冗長性」を提案する。これらのメタ指標を最適化することで、AutoLibraが従来のエージェント評価ベンチマークで提案されたものよりも具体的なエージェント評価指標を誘導し、エージェントを分析するための新しい指標を発見する能力を実験的に実証する。また、AutoLibraのエージェント改善における2つの応用例を示す。まず、AutoLibraによって誘導された指標が、テキストゲームタスクの広範な範囲において、タスク成功率よりも優れたプロンプトエンジニアリングの目標として機能し、ベースラインよりも平均20%のエージェント性能向上をもたらすことを示す。次に、AutoLibraがウェブナビゲーションエージェントの高品質なファインチューニングデータを反復的に選択できることを示す。我々の結果は、AutoLibraが言語エージェントを評価し改善するための強力なタスク非依存ツールであることを示唆している。

English

Agents are predominantly evaluated and optimized via task success metrics, which are coarse, rely on manual design from experts, and fail to reward intermediate emergent behaviors. We propose AutoLibra, a framework for agent evaluation, that transforms open-ended human feedback, e.g., "If you find that the button is disabled, don't click it again", or "This agent has too much autonomy to decide what to do on its own", into metrics for evaluating fine-grained behaviors in agent trajectories. AutoLibra accomplishes this by grounding feedback to an agent's behavior, clustering similar positive and negative behaviors, and creating concrete metrics with clear definitions and concrete examples, which can be used for prompting LLM-as-a-Judge as evaluators. We further propose two meta-metrics to evaluate the alignment of a set of (induced) metrics with open feedback: "coverage" and "redundancy". Through optimizing these meta-metrics, we experimentally demonstrate AutoLibra's ability to induce more concrete agent evaluation metrics than the ones proposed in previous agent evaluation benchmarks and discover new metrics to analyze agents. We also present two applications of AutoLibra in agent improvement: First, we show that AutoLibra-induced metrics serve as better prompt-engineering targets than the task success rate on a wide range of text game tasks, improving agent performance over baseline by a mean of 20%. Second, we show that AutoLibra can iteratively select high-quality fine-tuning data for web navigation agents. Our results suggest that AutoLibra is a powerful task-agnostic tool for evaluating and improving language agents.

AutoLibra: オープンエンドなフィードバックからのエージェントメトリック誘導

AutoLibra: Agent Metric Induction from Open-Ended Feedback

要旨

Support