평가 엔지니어링을 향하여: 실제 환경에서의 ML 평가 하네스에 대한 실증적 연구

초록

평가 하네스는 모델 호출, 데이터 로딩, 메트릭 계산 및 결과 보고를 조정하여 모델 평가를 체계화하는 소프트웨어 시스템이다. 머신러닝 인프라에서 중요한 역할을 함에도 불구하고, 이들의 운영상 과제와 엔지니어링 문제는 지금까지 제한된 주목만 받아왔다. 본 연구는 57개 평가 하네스에 대한 실증적 연구를 통해 5단계 하네스 모델을 도출하고, 16,560개의 이슈를 워크플로우 단계와 근본 원인별로 분류하였다. 대부분의 하네스 운영 과제는 명세 단계(이슈의 41.4%)에 집중되며, 이 단계에서 하네스는 외부 모델, 데이터셋 및 평가 판정기를 통합한다. 운영 과제의 가장 빈번한 세 가지 근본 원인은 미구현 기능(24.3%), 문서화 부족(20.3%), 입력 검증 누락(17.2%)이며, 이 세 가지가 분류된 이슈의 61.7%를 차지하며, 기존 기능의 결함과 의도된 워크플로우를 차단하는 기능 격차 모두를 포괄한다. 근본 원인은 워크플로우 단계에 따라 달라지는데, 환경 비호환 및 외부 의존성 손상은 프로비저닝 이슈의 36.2%를 차지하는 반면, 알고리즘 오류(25.9%)와 검증 공백(22.5%)은 평가 이슈에서 주를 이룬다. 이러한 기여는 평가 엔지니어링을 별개의 소프트웨어 엔지니어링 분야로 다루기 위한 실증적 토대를 마련한다.

English

Evaluation harnesses are software systems that orchestrate model evaluation by managing model invocation, data loading, metric computation, and result reporting. Despite their critical role in machine learning infrastructure, their operational challenges and engineering concerns have received limited attention so far. We present an empirical study of 57 evaluation harnesses, deriving a five-stage harness model and classifying 16,560 issues by workflow stage and root cause. Most harness operational challenges concentrate in the Specification stage (41.4% of issues), where harnesses integrate external models, datasets, and scoring judges. The three most frequent root causes of operational challenges are unimplemented features (24.3%), documentation gaps (20.3%), and missing input validation (17.2%), which together account for 61.7% of classified issues, spanning both defects in existing functionality and capability gaps that block intended workflows. Root causes also vary by workflow stage: environment incompatibility and external dependency breakage account for 36.2% of provisioning issues, whereas algorithmic error (25.9%) and validation gap (22.5%) dominate assessment issues. Together, these contributions establish an empirical foundation for treating evaluation engineering as a distinct software engineering concern.