ProEval:生成式AI评估中的主动故障发现与高效性能估测系统
ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation
April 25, 2026
作者: Yizheng Huang, Wenjun Zeng, Aditi Kumaresan, Zi Wang
cs.AI
摘要
随着推理速度缓慢、人工评估成本高昂以及模型与基准测试的快速扩张,生成式AI模型的评估正变得日益耗费资源。我们提出ProEval——一种基于迁移学习的主动评估框架,能够高效估算模型性能并识别失败案例。该框架采用预训练的高斯过程作为性能评分函数的代理模型,将模型输入映射至错误严重程度或安全违规等指标。通过将性能估计构建为贝叶斯求积问题、将失败案例发现构建为超水平集采样问题,我们开发出具有不确定性感知的决策策略,可主动选择或合成高信息量的测试输入。理论上,我们证明了基于预训练高斯过程的贝叶斯求积估计量具有无偏性和有界性。在推理、安全对齐和分类基准测试上的大量实验表明,ProEval相比竞争基线方法显著提升效率:在达到与真实值误差1%以内的估计精度时,所需样本量减少8-65倍,同时在严格评估预算下能发现更多样化的失败案例。
English
Evaluating generative AI models is increasingly resource-intensive due to slow inference, expensive raters, and a rapidly growing landscape of models and benchmarks. We propose ProEval, a proactive evaluation framework that leverages transfer learning to efficiently estimate performance and identify failure cases. ProEval employs pre-trained Gaussian Processes (GPs) as surrogates for the performance score function, mapping model inputs to metrics such as the severity of errors or safety violations. By framing performance estimation as Bayesian quadrature (BQ) and failure discovery as superlevel set sampling, we develop uncertainty-aware decision strategies that actively select or synthesize highly informative inputs for testing. Theoretically, we prove that our pre-trained GP-based BQ estimator is unbiased and bounded. Empirically, extensive experiments on reasoning, safety alignment, and classification benchmarks demonstrate that ProEval is significantly more efficient than competitive baselines. It requires 8-65x fewer samples to achieve estimates within 1% of the ground truth, while simultaneously revealing more diverse failure cases under a stricter evaluation budget.