AutoLibra:基於開放式反饋的代理度量歸納
AutoLibra: Agent Metric Induction from Open-Ended Feedback
May 5, 2025
作者: Hao Zhu, Phil Cuvin, Xinkai Yu, Charlotte Ka Yee Yan, Jason Zhang, Diyi Yang
cs.AI
摘要
代理主要通過任務成功率等指標進行評估與優化,這些指標較為粗糙,依賴於專家的手動設計,且未能獎勵中間湧現的行為。我們提出了AutoLibra,一個用於代理評估的框架,它能將開放式的人類反饋,例如“如果你發現按鈕被禁用,就不要再點擊它”,或“這個代理在決定自行行動時擁有過多的自主權”,轉化為評估代理軌跡中細粒度行為的指標。AutoLibra通過將反饋與代理的行為相聯繫,聚類相似的正負面行為,並創建具有明確定義和具體示例的具體指標來實現這一點,這些指標可用於提示作為評判者的大型語言模型(LLM)。我們進一步提出了兩個元指標來評估一組(誘導出的)指標與開放反饋的對齊程度:“覆蓋率”和“冗餘度”。通過優化這些元指標,我們實驗性地展示了AutoLibra在誘導更具體的代理評估指標方面的能力,這些指標比以往代理評估基準中提出的更為具體,並發現了新的指標來分析代理。我們還展示了AutoLibra在代理改進中的兩個應用:首先,我們表明,在廣泛的文本遊戲任務中,AutoLibra誘導的指標作為提示工程目標比任務成功率更為有效,將代理性能相較基線平均提升了20%。其次,我們展示了AutoLibra能夠迭代地選擇高質量的微調數據用於網絡導航代理。我們的結果表明,AutoLibra是一個強大的、任務無關的工具,用於評估和改進語言代理。
English
Agents are predominantly evaluated and optimized via task success metrics,
which are coarse, rely on manual design from experts, and fail to reward
intermediate emergent behaviors. We propose AutoLibra, a framework for agent
evaluation, that transforms open-ended human feedback, e.g., "If you find that
the button is disabled, don't click it again", or "This agent has too much
autonomy to decide what to do on its own", into metrics for evaluating
fine-grained behaviors in agent trajectories. AutoLibra accomplishes this by
grounding feedback to an agent's behavior, clustering similar positive and
negative behaviors, and creating concrete metrics with clear definitions and
concrete examples, which can be used for prompting LLM-as-a-Judge as
evaluators. We further propose two meta-metrics to evaluate the alignment of a
set of (induced) metrics with open feedback: "coverage" and "redundancy".
Through optimizing these meta-metrics, we experimentally demonstrate
AutoLibra's ability to induce more concrete agent evaluation metrics than the
ones proposed in previous agent evaluation benchmarks and discover new metrics
to analyze agents. We also present two applications of AutoLibra in agent
improvement: First, we show that AutoLibra-induced metrics serve as better
prompt-engineering targets than the task success rate on a wide range of text
game tasks, improving agent performance over baseline by a mean of 20%. Second,
we show that AutoLibra can iteratively select high-quality fine-tuning data for
web navigation agents. Our results suggest that AutoLibra is a powerful
task-agnostic tool for evaluating and improving language agents.Summary
AI-Generated Summary