AIエージェントの信頼性に向けた科学の構築

要旨

AIエージェントは、重要なタスクを実行するためにますます導入されるようになっている。標準ベンチマークにおける精度スコアの上昇は急速な進歩を示唆しているが、実際には多くのエージェントが未だに失敗を続けている。この不一致は、現在の評価手法の根本的な限界を浮き彫りにしている。すなわち、エージェントの振る舞いを単一の成功率指標に圧縮することは、重大な動作上の欠陥を見えにくくしてしまう。特に、エージェントが実行間で一貫して動作するか、摂動に耐えられるか、予測可能な形で失敗するか、エラーの重大度が限定されているか、といった点が無視されがちである。安全重視のエンジニアリングに基づき、我々は信頼性を4つの主要次元（一貫性、頑健性、予測可能性、安全性）に分解する12の具体的な指標を提案し、包括的な性能プロファイルを提供する。2つの相補的なベンチマークで14のエージェントモデルを評価した結果、最近の能力向上は信頼性においてわずかな改善しかもたらしていないことが明らかになった。これらの根強い限界を可視化することで、我々の指標は従来の評価を補完しつつ、エージェントがどのように動作し、性能が低下し、失敗するかを考察するためのツールを提供する。

English

AI agents are increasingly deployed to execute important tasks. While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice. This discrepancy highlights a fundamental limitation of current evaluations: compressing agent behavior into a single success metric obscures critical operational flaws. Notably, it ignores whether agents behave consistently across runs, withstand perturbations, fail predictably, or have bounded error severity. Grounded in safety-critical engineering, we provide a holistic performance profile by proposing twelve concrete metrics that decompose agent reliability along four key dimensions: consistency, robustness, predictability, and safety. Evaluating 14 agentic models across two complementary benchmarks, we find that recent capability gains have only yielded small improvements in reliability. By exposing these persistent limitations, our metrics complement traditional evaluations while offering tools for reasoning about how agents perform, degrade, and fail.

AIエージェントの信頼性に向けた科学の構築

Towards a Science of AI Agent Reliability

要旨

Support