評価エンジニアリングに向けて：実環境におけるML評価ハーネスの実証研究

要旨

評価ハーネスとは、モデルの呼び出し、データの読み込み、メトリクスの計算、結果の報告を管理することでモデル評価を統括するソフトウェアシステムである。機械学習インフラストラクチャにおいて重要な役割を担う一方で、その運用上の課題や工学的な問題はこれまでほとんど注目されてこなかった。本稿では、57の評価ハーネスを対象とした実証研究を実施し、5段階からなるハーネスモデルを導出するとともに、16,560件の課題をワークフローステージと根本原因に基づいて分類した。ハーネスの運用上の課題は、主に仕様策定段階（課題全体の41.4%）に集中しており、この段階ではハーネスが外部モデル、データセット、スコアリング判定機能と統合される。運用上の課題の根本原因として頻度が高いのは、未実装の機能（24.3%）、ドキュメントの不足（20.3%）、入力バリデーションの欠如（17.2%）の3つであり、これらを合わせると分類された課題の61.7%を占める。これらは既存機能の欠陥と、意図したワークフローを阻害する能力不足の両方にわたる。根本原因はワークフローステージによっても異なり、環境非互換性と外部依存関係の破損はプロビジョニング段階の課題の36.2%を占める一方、評価段階ではアルゴリズムエラー（25.9%）とバリデーションの欠如（22.5%）が支配的である。これらの知見は、評価工学をソフトウェア工学の独立した分野として扱うための実証的基盤を提供するものである。

English

Evaluation harnesses are software systems that orchestrate model evaluation by managing model invocation, data loading, metric computation, and result reporting. Despite their critical role in machine learning infrastructure, their operational challenges and engineering concerns have received limited attention so far. We present an empirical study of 57 evaluation harnesses, deriving a five-stage harness model and classifying 16,560 issues by workflow stage and root cause. Most harness operational challenges concentrate in the Specification stage (41.4% of issues), where harnesses integrate external models, datasets, and scoring judges. The three most frequent root causes of operational challenges are unimplemented features (24.3%), documentation gaps (20.3%), and missing input validation (17.2%), which together account for 61.7% of classified issues, spanning both defects in existing functionality and capability gaps that block intended workflows. Root causes also vary by workflow stage: environment incompatibility and external dependency breakage account for 36.2% of provisioning issues, whereas algorithmic error (25.9%) and validation gap (22.5%) dominate assessment issues. Together, these contributions establish an empirical foundation for treating evaluation engineering as a distinct software engineering concern.