迈向评估工程化：机器学习评估框架在实际环境中的实证研究

摘要

評估框架（evaluation harnesses）是透過管理模型調用、資料載入、指標計算與結果報告來協調模型評估的軟體系統。儘管其在機器學習基礎設施中扮演關鍵角色，但其運作挑戰與工程問題迄今尚未獲得足夠重視。我們針對57個評估框架進行實證研究，歸納出五階段框架模型，並依據工作流程階段與根本原因將16,560個議題進行分類。大多數評估框架的運作挑戰集中在規格階段（佔41.4%的議題），此階段框架需整合外部模型、資料集與評分裁判。運作挑戰最常見的三項根本原因為：未實作功能（24.3%）、文件缺口（20.3%）以及缺乏輸入驗證（17.2%），三者合計佔分類議題的61.7%，涵蓋既有功能缺陷與阻礙預期工作流程的能力缺口。根本原因亦隨工作流程階段而異：環境不相容與外部依賴中斷佔佈建問題的36.2%，而演算法錯誤（25.9%）與驗證缺口（22.5%）則主導評估問題。綜合以上貢獻，本研究為將評估工程視為獨立的軟體工程領域建立了實證基礎。

English

Evaluation harnesses are software systems that orchestrate model evaluation by managing model invocation, data loading, metric computation, and result reporting. Despite their critical role in machine learning infrastructure, their operational challenges and engineering concerns have received limited attention so far. We present an empirical study of 57 evaluation harnesses, deriving a five-stage harness model and classifying 16,560 issues by workflow stage and root cause. Most harness operational challenges concentrate in the Specification stage (41.4% of issues), where harnesses integrate external models, datasets, and scoring judges. The three most frequent root causes of operational challenges are unimplemented features (24.3%), documentation gaps (20.3%), and missing input validation (17.2%), which together account for 61.7% of classified issues, spanning both defects in existing functionality and capability gaps that block intended workflows. Root causes also vary by workflow stage: environment incompatibility and external dependency breakage account for 36.2% of provisioning issues, whereas algorithmic error (25.9%) and validation gap (22.5%) dominate assessment issues. Together, these contributions establish an empirical foundation for treating evaluation engineering as a distinct software engineering concern.