面向评估工程：对现实世界中机器学习评估框架的实证研究

摘要

评估系统是协调模型评估的软件框架，负责管理模型调用、数据加载、指标计算和结果报告。尽管其在机器学习基础设施中扮演关键角色，但其运行挑战和工程问题迄今未获充分关注。我们对57个评估系统进行实证研究，推导出包含五个阶段的评估系统模型，并按工作流阶段和根本原因对16,560个问题进行归类。大多数评估系统运行挑战集中在规范阶段（占问题总数的41.4%），该阶段需整合外部模型、数据集和评分裁判。运行挑战最常见的三种根本原因是未实现功能（24.3%）、文档缺失（20.3%）和缺失输入验证（17.2%），三者合计占归类问题的61.7%，涵盖既有功能缺陷和阻碍预期工作流的能力缺口。根本原因亦随工作流阶段动态变化：环境不兼容与外部依赖断裂占供应问题的36.2%，而算法错误（25.9%）和验证缺失（22.5%）主导了评估问题。这些发现共同为将评估工程作为独立的软件工程问题奠定实证基础。

English

Evaluation harnesses are software systems that orchestrate model evaluation by managing model invocation, data loading, metric computation, and result reporting. Despite their critical role in machine learning infrastructure, their operational challenges and engineering concerns have received limited attention so far. We present an empirical study of 57 evaluation harnesses, deriving a five-stage harness model and classifying 16,560 issues by workflow stage and root cause. Most harness operational challenges concentrate in the Specification stage (41.4% of issues), where harnesses integrate external models, datasets, and scoring judges. The three most frequent root causes of operational challenges are unimplemented features (24.3%), documentation gaps (20.3%), and missing input validation (17.2%), which together account for 61.7% of classified issues, spanning both defects in existing functionality and capability gaps that block intended workflows. Root causes also vary by workflow stage: environment incompatibility and external dependency breakage account for 36.2% of provisioning issues, whereas algorithmic error (25.9%) and validation gap (22.5%) dominate assessment issues. Together, these contributions establish an empirical foundation for treating evaluation engineering as a distinct software engineering concern.