迈向人工智能代理可靠性科学

摘要

人工智能代理正被日益广泛地部署以执行重要任务。尽管标准基准测试中不断攀升的准确率分数显示出快速进步，但许多代理在实践中仍持续出现故障。这种差异揭示了当前评估体系的基本局限：将代理行为压缩为单一成功指标的做法，掩盖了关键的操作缺陷。尤其值得注意的是，这种评估忽略了代理在不同运行中的行为一致性、抗干扰能力、故障可预测性以及错误严重程度限制等要素。基于安全关键工程的理念，我们通过提出12个具体指标构建了整体性能画像，从一致性、鲁棒性、可预测性和安全性四个关键维度解构代理可靠性。通过对14种代理模型在两个互补基准测试中的评估，我们发现近期能力提升仅带来可靠性的微小改善。通过揭示这些持续存在的局限，我们的指标体系在补充传统评估方法的同时，为理解代理如何运行、衰退和失效提供了分析工具。

English

AI agents are increasingly deployed to execute important tasks. While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice. This discrepancy highlights a fundamental limitation of current evaluations: compressing agent behavior into a single success metric obscures critical operational flaws. Notably, it ignores whether agents behave consistently across runs, withstand perturbations, fail predictably, or have bounded error severity. Grounded in safety-critical engineering, we provide a holistic performance profile by proposing twelve concrete metrics that decompose agent reliability along four key dimensions: consistency, robustness, predictability, and safety. Evaluating 14 agentic models across two complementary benchmarks, we find that recent capability gains have only yielded small improvements in reliability. By exposing these persistent limitations, our metrics complement traditional evaluations while offering tools for reasoning about how agents perform, degrade, and fail.

迈向人工智能代理可靠性科学

Towards a Science of AI Agent Reliability

摘要

Support