Claw-Eval: 自律エージェントの信頼性ある評価を目指して

要旨

大規模言語モデルは、実世界のソフトウェア環境において自律エージェントとして多段階のワークフローを実行する形で展開される機会が増えています。しかし、既存のエージェント評価基準には、(1)最終出力のみを確認する軌跡不透明な採点、(2)安全性とロバスト性の評価が不十分であること、(3)限定的なモダリティ対応とインタラクションパラダイムという3つの重大な課題があります。本研究では、これら3つのギャップをすべて解決するエンドツーエンドの評価スイート「Claw-Eval」を提案します。このスイートは、3つのグループ（汎用サービスオーケストレーション、マルチモーダル知覚・生成、多段階専門対話）にわたる9カテゴリの300件の人手検証済みタスクで構成されています。すべてのエージェント動作は3つの独立した証跡チャネル（実行トレース、監査ログ、環境スナップショット）を通じて記録され、2,159項目の詳細な評価基準に基づく軌跡を考慮した採点を可能にします。採点プロトコルは完了率、安全性、ロバスト性を評価し、3回の試行における平均スコア、Pass@k、Pass^kを報告することで、真の能力と幸運による結果を区別します。14の先進モデルを用いた実験により以下の知見が得られました：(1)軌跡不透明な評価は系統的に信頼性が低く、当社のハイブリッドパイプラインが検出する安全性違反の44%、ロバスト性障害の13%を見逃している、(2)制御されたエラー注入はピーク性能よりも一貫性を低下させ、Pass^3が最大24%低下する一方でPass@3は安定している、(3)マルチモーダル性能には顕著なばらつきがあり、ほとんどのモデルで動画処理の性能が文書や画像よりも低く、全モダリティで優位なモデルは存在しない。ベンチマーキングに加えて、Claw-Evalはエージェント開発における具体的な方向性を示し、単に能力が高いだけでなく、確実に展開可能なエージェントを構築するための要件を明らかにします。

English

Large language models are increasingly deployed as autonomous agents executing multi-step workflows in real-world software environments. However, existing agent benchmarks suffer from three critical limitations: (1) trajectory-opaque grading that checks only final outputs, (2) underspecified safety and robustness evaluation, and (3) narrow modality coverage and interaction paradigms. We introduce Claw-Eval, an end-to-end evaluation suite addressing all three gaps. It comprises 300 human-verified tasks spanning 9 categories across three groups (general service orchestration, multimodal perception and generation, and multi-turn professional dialogue). Every agent action is recorded through three independent evidence channels (execution traces, audit logs, and environment snapshots), enabling trajectory-aware grading over 2,159 fine-grained rubric items. The scoring protocol evaluates Completion, Safety, and Robustness, reporting Average Score, Pass@k, and Pass^k across three trials to distinguish genuine capability from lucky outcomes. Experiments on 14 frontier models reveal that: (1) trajectory-opaque evaluation is systematically unreliable, missing 44% of safety violations and 13% of robustness failures that our hybrid pipeline catches; (2) controlled error injection primarily degrades consistency rather than peak capability, with Pass^3 dropping up to 24% while Pass@3 remains stable; (3) multimodal performance varies sharply, with most models performing poorer on video than on document or image, and no single model dominating across all modalities. Beyond benchmarking, Claw-Eval highlights actionable directions for agent development, shedding light on what it takes to build agents that are not only capable but reliably deployable.

Claw-Eval: 自律エージェントの信頼性ある評価を目指して

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

要旨

Support