DISCO:多样化样本压缩以实现高效模型评估
DISCO: Diversifying Sample Condensation for Efficient Model Evaluation
October 9, 2025
作者: Alexander Rubinstein, Benjamin Raible, Martin Gubri, Seong Joon Oh
cs.AI
摘要
评估现代机器学习模型的成本已变得极其高昂。诸如LMMs-Eval和HELM等基准测试,每个模型需耗费数千GPU小时。高昂的评估成本降低了研究的包容性,延缓了创新周期,并加剧了环境影响。传统方法通常分为两步:首先,选取一个数据锚点子集;其次,训练一个从该子集上的准确率到最终测试结果的映射关系。然而,此方法的局限在于锚点选择依赖于聚类,这一过程复杂且对设计选择敏感。我们认为,样本间的多样性并非关键,关键在于选择那些能最大化模型响应差异的样本。我们提出的方法——多样化样本浓缩(DISCO),通过选取模型间分歧最大的前k个样本,采用基于样本的贪婪统计策略,而非全局聚类,概念上更为简洁。从理论角度看,模型间的分歧为此类贪婪选择提供了信息论上的最优准则。DISCO在MMLU、Hellaswag、Winogrande和ARC等数据集上的性能预测中,相较于现有方法取得了显著提升,达到了当前最优水平。相关代码已公开于:https://github.com/arubique/disco-public。
English
Evaluating modern machine learning models has become prohibitively expensive.
Benchmarks such as LMMs-Eval and HELM demand thousands of GPU hours per model.
Costly evaluation reduces inclusivity, slows the cycle of innovation, and
worsens environmental impact. The typical approach follows two steps. First,
select an anchor subset of data. Second, train a mapping from the accuracy on
this subset to the final test result. The drawback is that anchor selection
depends on clustering, which can be complex and sensitive to design choices. We
argue that promoting diversity among samples is not essential; what matters is
to select samples that maximise diversity in model responses. Our
method, Diversifying Sample Condensation (DISCO), selects the top-k
samples with the greatest model disagreements. This uses greedy, sample-wise
statistics rather than global clustering. The approach is conceptually simpler.
From a theoretical view, inter-model disagreement provides an
information-theoretically optimal rule for such greedy selection.
DISCO shows empirical gains over prior methods, achieving
state-of-the-art results in performance prediction across MMLU, Hellaswag,
Winogrande, and ARC. Code is available here:
https://github.com/arubique/disco-public.