ChatPaper.aiChatPaper

立场:AI竞赛为生成式AI评估提供了实证严谨性的黄金标准

Position: AI Competitions Provide the Gold Standard for Empirical Rigor in GenAI Evaluation

May 1, 2025
作者: D. Sculley, Will Cukierski, Phil Culliton, Sohier Dane, Maggie Demkin, Ryan Holbrook, Addison Howard, Paul Mooney, Walter Reade, Megan Risdal, Nate Keating
cs.AI

摘要

在本立场文件中,我们观察到生成式人工智能(Generative AI)的实证评估正处于一个危机点,因为传统的机器学习评估与基准测试策略已不足以满足评估现代生成式AI模型及系统的需求。这一现象背后有多重原因,包括这些模型通常具有近乎无限的输入输出空间、往往缺乏明确定义的真实目标,以及常常展现出基于先前模型输出情境的强烈反馈循环与预测依赖性。在这些关键问题之上,我们认为,对于生成式AI评估而言,**泄露**与**污染**问题实际上是最重要且最难解决的挑战。有趣的是,人工智能竞赛领域已发展出有效的措施与实践,旨在竞赛环境中对抗不良行为者的作弊行为,从而有效应对泄露问题。这使得人工智能竞赛成为一项特别宝贵(但尚未充分利用)的资源。当前,正是时候将人工智能竞赛视为生成式AI评估中实证严谨性的黄金标准,并据此价值来利用和汲取其成果。
English
In this position paper, we observe that empirical evaluation in Generative AI is at a crisis point since traditional ML evaluation and benchmarking strategies are insufficient to meet the needs of evaluating modern GenAI models and systems. There are many reasons for this, including the fact that these models typically have nearly unbounded input and output spaces, typically do not have a well defined ground truth target, and typically exhibit strong feedback loops and prediction dependence based on context of previous model outputs. On top of these critical issues, we argue that the problems of {\em leakage} and {\em contamination} are in fact the most important and difficult issues to address for GenAI evaluations. Interestingly, the field of AI Competitions has developed effective measures and practices to combat leakage for the purpose of counteracting cheating by bad actors within a competition setting. This makes AI Competitions an especially valuable (but underutilized) resource. Now is time for the field to view AI Competitions as the gold standard for empirical rigor in GenAI evaluation, and to harness and harvest their results with according value.

Summary

AI-Generated Summary

PDF51May 13, 2025