ChatPaper.aiChatPaper

DISCO:多樣化樣本濃縮以實現高效模型評估

DISCO: Diversifying Sample Condensation for Efficient Model Evaluation

October 9, 2025
作者: Alexander Rubinstein, Benjamin Raible, Martin Gubri, Seong Joon Oh
cs.AI

摘要

評估現代機器學習模型的成本已變得極其高昂。像LMMs-Eval和HELM這樣的基準測試,每個模型都需要消耗數千個GPU小時。昂貴的評估過程降低了包容性,拖慢了創新週期,並加劇了環境影響。傳統方法通常遵循兩個步驟:首先,選擇一個數據錨點子集;其次,訓練一個從該子集上的準確率到最終測試結果的映射。這種方法的缺點在於,錨點選擇依賴於聚類,這可能既複雜又對設計選擇敏感。我們認為,促進樣本多樣性並非關鍵;重要的是選擇那些能最大化模型響應多樣性的樣本。我們的方法,多樣化樣本濃縮(DISCO),選擇了模型間分歧最大的前k個樣本。這使用了貪婪的、基於樣本的統計方法,而非全局聚類。該方法在概念上更為簡潔。從理論角度來看,模型間的分歧為這種貪婪選擇提供了信息理論上的最優規則。DISCO在性能預測方面展現了相較於先前方法的實證優勢,在MMLU、Hellaswag、Winogrande和ARC等基準測試中取得了最先進的成果。代碼可在此處獲取: https://github.com/arubique/disco-public。
English
Evaluating modern machine learning models has become prohibitively expensive. Benchmarks such as LMMs-Eval and HELM demand thousands of GPU hours per model. Costly evaluation reduces inclusivity, slows the cycle of innovation, and worsens environmental impact. The typical approach follows two steps. First, select an anchor subset of data. Second, train a mapping from the accuracy on this subset to the final test result. The drawback is that anchor selection depends on clustering, which can be complex and sensitive to design choices. We argue that promoting diversity among samples is not essential; what matters is to select samples that maximise diversity in model responses. Our method, Diversifying Sample Condensation (DISCO), selects the top-k samples with the greatest model disagreements. This uses greedy, sample-wise statistics rather than global clustering. The approach is conceptually simpler. From a theoretical view, inter-model disagreement provides an information-theoretically optimal rule for such greedy selection. DISCO shows empirical gains over prior methods, achieving state-of-the-art results in performance prediction across MMLU, Hellaswag, Winogrande, and ARC. Code is available here: https://github.com/arubique/disco-public.
PDF142October 13, 2025