評価カード：AI評価報告のための解釈層

要旨

AI評価結果は大規模に生成されるが、リーダーボード、モデルカード、ベンチマーク論文、企業ブログ間で一貫性のない報告が行われている。その代償は解釈上のものとなる。読者は、情報源間で結果を確実に比較したり、報告が何を省略しているかを特定したり、集約された主張をその根拠となる証拠に遡ったりすることができない。最近の取り組みは個別のコンポーネントに対処しているが、3つのギャップを残している。評価ライフサイクルの狭い部分しかカバーしておらず、単一の解釈可能な記録に統合されない。同じ証拠に対して異なるステークホルダーが持つ質問を区別しない静的な表現を指定している。また、紙上の提案にとどまり、大規模な採用に必要な抽出インフラストラクチャを欠いている。我々は、ベンチマークメタデータ、評価実行データ、モデルメタデータを統合記録に構成する運用レポート層、___ を提示する。我々は、(1) 52本の論文と10件のステークホルダーインタビューの構造化レビューからレポートスキーマを導出し、(2) 4つの解釈シグナル（再現性、文書完全性、来歴とリスク、スコア比較可能性）を実装し、研究および非研究者向けのオーディエンスに合わせて調整された読者モードで提示し、(3) 5,816モデル、635ベンチマーク、101,843件の結果にわたって___を適用し、現在の報告実践における系統的なギャップを明らかにする監視ツールを展開する。

English

AI evaluation results are produced at scale but reported inconsistently across leaderboards, model cards, benchmark papers, and company blogs. The cost is interpretive: readers cannot reliably compare results across sources, identify what a report omits, or trace an aggregate claim to its underlying evidence. Recent efforts address isolated components but leave three gaps: they cover only narrow slices of the evaluation lifecycle and do not compose into a single interpretable record; they specify static representations that do not differentiate the questions different stakeholders bring to the same evidence; and they remain proposals on paper, lacking the extraction infrastructure required for adoption at scale. We present , an operational reporting layer that composes benchmark metadata, evaluation run data, and model metadata into a unified record. We (1) derive a reporting schema from a structured review of 52 papers and 10 stakeholder interviews, (2) implement four interpretive signals (reproducibility, documentation completeness, provenance and risk, and score comparability), rendered through reader modes calibrated to research and non-research audiences, and (3) deploy a monitoring tool that applies across 5,816 models, 635 benchmarks, and 101,843 results, surfacing systematic gaps in current reporting practice.