ProEval: 生成AI評価のための能動的障害発見と効率的性能推定手法

要旨

生成的AIモデルの評価は、推論速度の遅さ、評価コストの高さ、そしてモデルとベンチマークの急激な増加により、リソース集約的な課題となっている。本研究では、転移学習を活用して効率的に性能を推定し、失敗ケースを特定するProEvalを提案する。ProEvalは、事前学習済みガウス過程（GP）を性能スコア関数の代理モデルとして用い、モデル入力からエラーの重大度や安全性違反などの指標へマッピングする。性能推定をベイズ求積法（BQ）として、失敗発見を超レベル集合サンプリングとして定式化することで、評価に高情報量の入力を能動的に選択または合成する不確実性を考慮した決定戦略を開発する。理論的には、事前学習済みGPに基づくBQ推定量が不偏かつ有界であることを証明する。実験的には、推論、安全性調整、分類のベンチマークにおける大規模実験を通じて、ProEvalが競合ベースラインより大幅に効率的であることを示す。真値の1％以内の推定値を達成するのに8～65倍少ないサンプル数で済み、より厳しい評価予算下でより多様な失敗ケースを同時に発見できる。

English

Evaluating generative AI models is increasingly resource-intensive due to slow inference, expensive raters, and a rapidly growing landscape of models and benchmarks. We propose ProEval, a proactive evaluation framework that leverages transfer learning to efficiently estimate performance and identify failure cases. ProEval employs pre-trained Gaussian Processes (GPs) as surrogates for the performance score function, mapping model inputs to metrics such as the severity of errors or safety violations. By framing performance estimation as Bayesian quadrature (BQ) and failure discovery as superlevel set sampling, we develop uncertainty-aware decision strategies that actively select or synthesize highly informative inputs for testing. Theoretically, we prove that our pre-trained GP-based BQ estimator is unbiased and bounded. Empirically, extensive experiments on reasoning, safety alignment, and classification benchmarks demonstrate that ProEval is significantly more efficient than competitive baselines. It requires 8-65x fewer samples to achieve estimates within 1% of the ground truth, while simultaneously revealing more diverse failure cases under a stricter evaluation budget.

ProEval: 生成AI評価のための能動的障害発見と効率的性能推定手法

ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation

要旨

Support