邁向人工智慧代理可靠性的科學之路

摘要

人工智慧代理正日益被部署執行重要任務。儘管標準基準測試的準確率分數顯示出快速進步，但許多代理在實際應用中仍持續失敗。這種差異凸顯了當前評估方法的根本局限：將代理行為壓縮為單一成功指標的做法，會掩蓋關鍵的操作缺陷。值得注意的是，這種評估忽略了代理在不同運行中的行為一致性、承受干擾的能力、失敗的可預測性，以及錯誤嚴重性的可控程度。基於安全關鍵工程的理念，我們透過提出十二項具體指標來建立整體性能畫像，將代理可靠性分解為四個關鍵維度：一致性、穩健性、可預測性和安全性。透過在兩個互補基準上評估14種代理模型，我們發現近期能力提升僅帶來可靠性的微小改善。通過揭示這些持續存在的局限，我們的指標不僅能補充傳統評估方法，更提供了分析代理如何運行、衰退與失效的工具框架。

English

AI agents are increasingly deployed to execute important tasks. While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice. This discrepancy highlights a fundamental limitation of current evaluations: compressing agent behavior into a single success metric obscures critical operational flaws. Notably, it ignores whether agents behave consistently across runs, withstand perturbations, fail predictably, or have bounded error severity. Grounded in safety-critical engineering, we provide a holistic performance profile by proposing twelve concrete metrics that decompose agent reliability along four key dimensions: consistency, robustness, predictability, and safety. Evaluating 14 agentic models across two complementary benchmarks, we find that recent capability gains have only yielded small improvements in reliability. By exposing these persistent limitations, our metrics complement traditional evaluations while offering tools for reasoning about how agents perform, degrade, and fail.

邁向人工智慧代理可靠性的科學之路

Towards a Science of AI Agent Reliability

摘要

Support