ChatPaper.aiChatPaper

立場:AI競賽為生成式AI評估提供了經驗嚴謹性的黃金標準

Position: AI Competitions Provide the Gold Standard for Empirical Rigor in GenAI Evaluation

May 1, 2025
作者: D. Sculley, Will Cukierski, Phil Culliton, Sohier Dane, Maggie Demkin, Ryan Holbrook, Addison Howard, Paul Mooney, Walter Reade, Megan Risdal, Nate Keating
cs.AI

摘要

在本立場文件中,我們觀察到生成式人工智慧(Generative AI)的實證評估正處於一個危機點,因為傳統的機器學習評估與基準測試策略已不足以滿足評估現代生成式AI模型與系統的需求。造成這種情況的原因眾多,包括這些模型通常具有幾乎無界的輸入與輸出空間、通常缺乏明確定義的真實目標,以及通常會基於先前模型輸出的上下文展現出強烈的反饋循環與預測依賴性。除了這些關鍵問題外,我們認為「洩漏」與「污染」問題實際上是生成式AI評估中最重要且最難解決的議題。有趣的是,AI競賽領域已發展出有效的措施與實踐來對抗洩漏,目的是在競賽環境中遏制不良行為者的作弊行為。這使得AI競賽成為一個特別有價值(但未被充分利用)的資源。現在是時候讓該領域將AI競賽視為生成式AI評估中實證嚴謹性的黃金標準,並據此價值來利用與收穫其成果。
English
In this position paper, we observe that empirical evaluation in Generative AI is at a crisis point since traditional ML evaluation and benchmarking strategies are insufficient to meet the needs of evaluating modern GenAI models and systems. There are many reasons for this, including the fact that these models typically have nearly unbounded input and output spaces, typically do not have a well defined ground truth target, and typically exhibit strong feedback loops and prediction dependence based on context of previous model outputs. On top of these critical issues, we argue that the problems of {\em leakage} and {\em contamination} are in fact the most important and difficult issues to address for GenAI evaluations. Interestingly, the field of AI Competitions has developed effective measures and practices to combat leakage for the purpose of counteracting cheating by bad actors within a competition setting. This makes AI Competitions an especially valuable (but underutilized) resource. Now is time for the field to view AI Competitions as the gold standard for empirical rigor in GenAI evaluation, and to harness and harvest their results with according value.

Summary

AI-Generated Summary

PDF51May 13, 2025