ChatPaper.aiChatPaper

评估卡片:AI评估报告的阐释层

Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting

June 8, 2026
作者: Avijit Ghosh, Anka Reuel, Jenny Chim, Wm. Matthew Kennedy, Srishti Yadav, Jennifer Mickel, Yanan Long, Andrew Tran, Anastassia Kornilova, Damian Stachura, Kevin Klyman, Felix Friedrich, Jeba Sania, Max Lamparth, Jan Batzner, Anoop Mishra, Eliya Habba, Yixiong Hao, Nathan Heath, Shalaleh Rismani, Usman Gohar, Andrea Loehr, David Manheim, Ruchira Dhar, Sree Harsha Nelaturu, Aarush Sinha, Leshem Choshen, Drishti Sharma, Ishan Khire, Amit Saha, Subramanyam Sahoo, Michael Hardy, Michael Alexander Riegler, Kabir Manghnani, Michelle Lin, Yanan Jiang, Yilin Huang, Asaf Yehudai, Jessica Ji, Aris Hofmann, Mubashara Akhtar, Nuno Moniz, Yacine Jernite, Stella Biderman, Zeerak Talat, Sanmi Koyejo, Mykel Kochenderfer, Irene Solaiman
cs.AI

摘要

AI评估结果虽然大规模产生,但在排行榜、模型卡、基准测试论文和公司博客等各类渠道中的报告方式却不一致。这带来了解读上的困难:读者无法可靠地跨来源比较结果,难以识别报告省略了哪些内容,也无法将汇总性声明追溯至其底层证据。近期的研究尝试应对了部分孤立问题,但仍存在三大缺口:它们仅覆盖评估生命周期中的狭小片段,无法整合为单一的、可解读的记录;它们规定了静态的呈现方式,无法区分不同利益相关者对同一证据提出的不同问题;它们仍停留在纸面提案层面,缺乏大规模采用所需的提取基础设施。我们提出了一种可操作化的报告层,将基准元数据、评估运行数据和模型元数据整合为统一的记录。我们(1)基于对52篇论文的结构化审查和10次利益相关者访谈,推导出一个报告模式;(2)实现了四种解读信号(可再现性、文档完整性、来源与风险、分数可比性),并通过面向研究与非研究受众校准的读者模式进行呈现;(3)部署了一个监控工具,将该报告层应用于5816个模型、635个基准测试和101843个结果,揭示了当前报告实践中的系统性缺口。
English
AI evaluation results are produced at scale but reported inconsistently across leaderboards, model cards, benchmark papers, and company blogs. The cost is interpretive: readers cannot reliably compare results across sources, identify what a report omits, or trace an aggregate claim to its underlying evidence. Recent efforts address isolated components but leave three gaps: they cover only narrow slices of the evaluation lifecycle and do not compose into a single interpretable record; they specify static representations that do not differentiate the questions different stakeholders bring to the same evidence; and they remain proposals on paper, lacking the extraction infrastructure required for adoption at scale. We present , an operational reporting layer that composes benchmark metadata, evaluation run data, and model metadata into a unified record. We (1) derive a reporting schema from a structured review of 52 papers and 10 stakeholder interviews, (2) implement four interpretive signals (reproducibility, documentation completeness, provenance and risk, and score comparability), rendered through reader modes calibrated to research and non-research audiences, and (3) deploy a monitoring tool that applies across 5,816 models, 635 benchmarks, and 101,843 results, surfacing systematic gaps in current reporting practice.