ProEval: 생성형 AI 평가를 위한 사전 장애 발견 및 효율적 성능 추정

초록

생성형 AI 모델 평가는 느린 추론 속도, 높은 비용의 평가자 활용, 그리고 빠르게 확장되는 모델 및 벤치마크 환경으로 인해 점점 더 많은 자원을 요구하고 있습니다. 본 연구에서는 전이 학습을 활용하여 성능을 효율적으로 추정하고 실패 사례를 식별하는 사전 평가 프레임워크인 ProEval을 제안합니다. ProEval은 사전 훈련된 가우시안 프로세스(GP)를 성능 점수 함수의 대리 모델로 활용하여 모델 입력을 오류 심각도나 안전성 위반과 같은 메트릭에 매핑합니다. 성능 추정을 베이지안 구적법(BQ)으로, 실패 발견을 상위 수준 집합 샘플링으로 설정함으로써, 테스트를 위해 매우 유익한 입력을 능동적으로 선택하거나 합성하는 불확실성 인지 의사 결정 전략을 개발합니다. 이론적으로는 사전 훈련된 GP 기반 BQ 추정기가 불편향성과 유계성을 가짐을 증명합니다. 실증적으로는 추론, 안전성 정렬, 분류 벤치마크에 대한 광범위한 실험을 통해 ProEval이 경쟁 기준선보다 현저히 효율적임을 입증합니다. 실제 참값의 1% 이내로 추정하기 위해 8~65배 적은 샘플만을 필요로 하며, 동시에 더 엄격한 평가 예산 하에서 더 다양하고 심각한 실패 사례를 발견합니다.

English

Evaluating generative AI models is increasingly resource-intensive due to slow inference, expensive raters, and a rapidly growing landscape of models and benchmarks. We propose ProEval, a proactive evaluation framework that leverages transfer learning to efficiently estimate performance and identify failure cases. ProEval employs pre-trained Gaussian Processes (GPs) as surrogates for the performance score function, mapping model inputs to metrics such as the severity of errors or safety violations. By framing performance estimation as Bayesian quadrature (BQ) and failure discovery as superlevel set sampling, we develop uncertainty-aware decision strategies that actively select or synthesize highly informative inputs for testing. Theoretically, we prove that our pre-trained GP-based BQ estimator is unbiased and bounded. Empirically, extensive experiments on reasoning, safety alignment, and classification benchmarks demonstrate that ProEval is significantly more efficient than competitive baselines. It requires 8-65x fewer samples to achieve estimates within 1% of the ground truth, while simultaneously revealing more diverse failure cases under a stricter evaluation budget.

ProEval: 생성형 AI 평가를 위한 사전 장애 발견 및 효율적 성능 추정

ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation

초록

Support