静的リーダーボードを超えて：LLMエージェント評価のための予測的妥当性

要旨

エージェントベンチマークは急速に拡大しているが、単一のベンチマークが実際のデプロイメントで露呈する次元のうち4～5以上をカバーすることはない。本論文では、MCPベースの産業用エージェントベンチマークに対する現在までで最大規模の協調的詳細調査を集約する。すなわち、新たな資産クラス（マルチモーダル視覚拡張を含む）、代替オーケストレーション、検索戦略、推論モード、インフラ最適化、評価手法の探究を網羅する14件の並行実装研究である。これらを先行する7件のエージェントベンチマークと統合し、総合スコアによるリーダーボードがデプロイされたエージェントの評価を体系的に過小特定していると論じる。総合スコアに由来するランキングは、分布外の設定に転移しない。最近の公開から非公開への競技振り返りは、このランク不安定性の直接的な実証的証拠を提供している。我々は、サンプル内平均ではなく、サンプル内とサンプル外のランク間の相関である予測妥当性によって構成をランク付けすることを提案し、HELMおよびそのエージェント時代の後継手法が崩壊させるデプロイメント関連次元を露呈する12層の測定装置を報告する。本立場は、明示的な閾値を備えた3つの反証可能な分布外基準を通じて運用化される。既存の証拠は部分的にこれを支持するが、確認するには乏しすぎる。最後に、事前登録されたパイロット設計と、次世代のエージェントベンチマークが報告すべき分野レベルのビジョンを提示する。

English

Agent benchmarks are growing fast, but no single benchmark touches more than four or five of the dimensions that deployment exposes. This paper aggregates the largest coordinated deep-dive of one MCP-based industrial-agent benchmark to date: fourteen parallel implementation studies covering new asset classes (including a multi-modal visual extension), alternative orchestrations, retrieval strategies, reasoning modes, infrastructure optimizations, and evaluation-methodology probes. Consolidating those studies with seven prior agent benchmarks, we argue that aggregate-score leaderboards systematically underspecify deployed-agent evaluation. Rankings derived from aggregate scores do not transfer to out-of-distribution settings; recent public-to-hidden competition retrospectives provide direct empirical evidence of this rank instability. We propose ranking configurations by predictive validity, the correlation between in-sample and out-of-sample rank, rather than in-sample mean, and report a twelve-tier measurement apparatus that exposes the deployment-relevant dimensions HELM and its agent-era successors collapse. The position is operationalized through three falsifiable out-of-distribution criteria with explicit thresholds; existing evidence partly supports it but is too thin to confirm. We close with a pre-registered pilot design and a field-level vision for what the next generation of agentic benchmarks should report.