ProEval: Proactieve Foutdetectie en Efficiënte Prestatie-inschatting voor de Evaluatie van Generatieve AI

Samenvatting

De evaluatie van generatieve AI-modellen wordt steeds resource-intensiever door trage inferentie, dure beoordelaars en een snel groeiend landschap van modellen en benchmarks. Wij stellen ProEval voor, een proactief evaluatieraamwerk dat transfer learning gebruikt om efficiënt prestaties in te schatten en faalgevallen te identificeren. ProEval gebruikt vooraf getrainde Gaussische Processen (GP's) als surrogaten voor de prestatie-scorefunctie, waarbij modelinvoer wordt gemapt naar metrieken zoals de ernst van fouten of veiligheidsschendingen. Door prestatieschatting te formuleren als Bayesiaanse kwadratuur (BQ) en het ontdekken van fouten als superlevel set sampling, ontwikkelen we onzekerheidsbewuste beslissingsstrategieën die actief zeer informatieve invoer selecteren of synthetiseren voor tests. Theoretisch bewijzen we dat onze op vooraf getrainde GP's gebaseerde BQ-schatter onbevooroordeeld en begrensd is. Empirisch tonen uitgebreide experimenten met redeneer-, veiligheidsalignerings- en classificatiebenchmarks aan dat ProEval aanzienlijk efficiënter is dan concurrerende baseline-methoden. Het vereist 8-65x minder samples om schattingen binnen 1% van de werkelijke waarde te bereiken, terwijl het tegelijkertijd meer diverse faalgevallen blootlegt onder een strenger evaluatiebudget.

English

Evaluating generative AI models is increasingly resource-intensive due to slow inference, expensive raters, and a rapidly growing landscape of models and benchmarks. We propose ProEval, a proactive evaluation framework that leverages transfer learning to efficiently estimate performance and identify failure cases. ProEval employs pre-trained Gaussian Processes (GPs) as surrogates for the performance score function, mapping model inputs to metrics such as the severity of errors or safety violations. By framing performance estimation as Bayesian quadrature (BQ) and failure discovery as superlevel set sampling, we develop uncertainty-aware decision strategies that actively select or synthesize highly informative inputs for testing. Theoretically, we prove that our pre-trained GP-based BQ estimator is unbiased and bounded. Empirically, extensive experiments on reasoning, safety alignment, and classification benchmarks demonstrate that ProEval is significantly more efficient than competitive baselines. It requires 8-65x fewer samples to achieve estimates within 1% of the ground truth, while simultaneously revealing more diverse failure cases under a stricter evaluation budget.

ProEval: Proactieve Foutdetectie en Efficiënte Prestatie-inschatting voor de Evaluatie van Generatieve AI

ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation

Samenvatting

Support