立場:AI競賽為生成式AI評估提供了經驗嚴謹性的黃金標準
Position: AI Competitions Provide the Gold Standard for Empirical Rigor in GenAI Evaluation
May 1, 2025
作者: D. Sculley, Will Cukierski, Phil Culliton, Sohier Dane, Maggie Demkin, Ryan Holbrook, Addison Howard, Paul Mooney, Walter Reade, Megan Risdal, Nate Keating
cs.AI
摘要
在本立場文件中,我們觀察到生成式人工智慧(Generative AI)的實證評估正處於一個危機點,因為傳統的機器學習評估與基準測試策略已不足以滿足評估現代生成式AI模型與系統的需求。造成這種情況的原因眾多,包括這些模型通常具有幾乎無界的輸入與輸出空間、通常缺乏明確定義的真實目標,以及通常會基於先前模型輸出的上下文展現出強烈的反饋循環與預測依賴性。除了這些關鍵問題外,我們認為「洩漏」與「污染」問題實際上是生成式AI評估中最重要且最難解決的議題。有趣的是,AI競賽領域已發展出有效的措施與實踐來對抗洩漏,目的是在競賽環境中遏制不良行為者的作弊行為。這使得AI競賽成為一個特別有價值(但未被充分利用)的資源。現在是時候讓該領域將AI競賽視為生成式AI評估中實證嚴謹性的黃金標準,並據此價值來利用與收穫其成果。
English
In this position paper, we observe that empirical evaluation in Generative AI
is at a crisis point since traditional ML evaluation and benchmarking
strategies are insufficient to meet the needs of evaluating modern GenAI models
and systems. There are many reasons for this, including the fact that these
models typically have nearly unbounded input and output spaces, typically do
not have a well defined ground truth target, and typically exhibit strong
feedback loops and prediction dependence based on context of previous model
outputs. On top of these critical issues, we argue that the problems of {\em
leakage} and {\em contamination} are in fact the most important and difficult
issues to address for GenAI evaluations. Interestingly, the field of AI
Competitions has developed effective measures and practices to combat leakage
for the purpose of counteracting cheating by bad actors within a competition
setting. This makes AI Competitions an especially valuable (but underutilized)
resource. Now is time for the field to view AI Competitions as the gold
standard for empirical rigor in GenAI evaluation, and to harness and harvest
their results with according value.Summary
AI-Generated Summary