ChatPaper.aiChatPaper

評估卡片:AI評估報告的解釋層

Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting

June 8, 2026
作者: Avijit Ghosh, Anka Reuel, Jenny Chim, Wm. Matthew Kennedy, Srishti Yadav, Jennifer Mickel, Yanan Long, Andrew Tran, Anastassia Kornilova, Damian Stachura, Kevin Klyman, Felix Friedrich, Jeba Sania, Max Lamparth, Jan Batzner, Anoop Mishra, Eliya Habba, Yixiong Hao, Nathan Heath, Shalaleh Rismani, Usman Gohar, Andrea Loehr, David Manheim, Ruchira Dhar, Sree Harsha Nelaturu, Aarush Sinha, Leshem Choshen, Drishti Sharma, Ishan Khire, Amit Saha, Subramanyam Sahoo, Michael Hardy, Michael Alexander Riegler, Kabir Manghnani, Michelle Lin, Yanan Jiang, Yilin Huang, Asaf Yehudai, Jessica Ji, Aris Hofmann, Mubashara Akhtar, Nuno Moniz, Yacine Jernite, Stella Biderman, Zeerak Talat, Sanmi Koyejo, Mykel Kochenderfer, Irene Solaiman
cs.AI

摘要

AI評估結果大量產出,但在排行榜、模型卡、基準論文與公司部落格中的報告方式卻不一致。這導致了解讀上的成本:讀者無法可靠地比較不同來源的結果,無法辨識報告中遺漏了哪些資訊,也無法將整體性主張追溯至其背後的證據。近期的努力針對了孤立的組成部分,但仍存在三個缺口:它們僅涵蓋評估生命週期中狹隘的片段,且無法組合成單一可解讀的記錄;它們指定了靜態的表示方式,無法區分不同利害關係人針對同一份證據所提出的問題;它們仍停留在紙本提案層面,缺乏大規模採用所需的萃取基礎設施。我們提出一個可操作的報告層,將基準後設資料、評估運行資料與模型後設資料組合成統一的記錄。我們:(1) 從52篇論文與10場利害關係人訪談的結構化回顧中推導出報告架構,(2) 實作四種解讀訊號(可再現性、文件完整性、來源與風險,以及分數可比性),並透過針對研究與非研究受眾校準的讀者模式呈現,以及 (3) 部署一個監控工具,將其應用於5,816個模型、635個基準與101,843筆結果,揭露當前報告實務中的系統性缺口。
English
AI evaluation results are produced at scale but reported inconsistently across leaderboards, model cards, benchmark papers, and company blogs. The cost is interpretive: readers cannot reliably compare results across sources, identify what a report omits, or trace an aggregate claim to its underlying evidence. Recent efforts address isolated components but leave three gaps: they cover only narrow slices of the evaluation lifecycle and do not compose into a single interpretable record; they specify static representations that do not differentiate the questions different stakeholders bring to the same evidence; and they remain proposals on paper, lacking the extraction infrastructure required for adoption at scale. We present , an operational reporting layer that composes benchmark metadata, evaluation run data, and model metadata into a unified record. We (1) derive a reporting schema from a structured review of 52 papers and 10 stakeholder interviews, (2) implement four interpretive signals (reproducibility, documentation completeness, provenance and risk, and score comparability), rendered through reader modes calibrated to research and non-research audiences, and (3) deploy a monitoring tool that applies across 5,816 models, 635 benchmarks, and 101,843 results, surfacing systematic gaps in current reporting practice.