ProEval:面向生成式AI评估的主动故障发现与高效性能估计算法
ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation
April 25, 2026
作者: Yizheng Huang, Wenjun Zeng, Aditi Kumaresan, Zi Wang
cs.AI
摘要
由於推理速度緩慢、人工評估成本高昂以及模型與基準測試的快速擴展,生成式AI模型的評估日益耗費資源。我們提出ProEval——一種基於遷移學習的主動式評估框架,能高效估算性能並識別故障案例。該框架採用預訓練的高斯過程作為性能評分函數的代理模型,將模型輸入映射至錯誤嚴重程度或安全違規等指標。通過將性能估算建模為貝葉斯求積、故障發現建模為超水平集採樣,我們開發了具備不確定性感知的決策策略,可主動選擇或合成高信息量的測試輸入。理論上,我們證明了基於預訓練高斯過程的貝葉斯求積估計器具有無偏性和有界性。實證方面,在推理、安全對齊和分類基準測試上的大量實驗表明,ProEval相比競爭基線方法顯著提升效率:僅需1/8至1/65的樣本量即可實現與真實值誤差小於1%的估計,同時在更嚴格的評估預算下能發現更多樣化的故障案例。
English
Evaluating generative AI models is increasingly resource-intensive due to slow inference, expensive raters, and a rapidly growing landscape of models and benchmarks. We propose ProEval, a proactive evaluation framework that leverages transfer learning to efficiently estimate performance and identify failure cases. ProEval employs pre-trained Gaussian Processes (GPs) as surrogates for the performance score function, mapping model inputs to metrics such as the severity of errors or safety violations. By framing performance estimation as Bayesian quadrature (BQ) and failure discovery as superlevel set sampling, we develop uncertainty-aware decision strategies that actively select or synthesize highly informative inputs for testing. Theoretically, we prove that our pre-trained GP-based BQ estimator is unbiased and bounded. Empirically, extensive experiments on reasoning, safety alignment, and classification benchmarks demonstrate that ProEval is significantly more efficient than competitive baselines. It requires 8-65x fewer samples to achieve estimates within 1% of the ground truth, while simultaneously revealing more diverse failure cases under a stricter evaluation budget.