向最优者学习,但求不同:基于多样性的数据选择新思考
Learning from the Best, Differently: A Diversity-Driven Rethinking on Data Selection
October 21, 2025
作者: Hongyi He, Xiao Liu, Zhenghao Lin, Mingni Tang, Yi Cheng, Jintao Wang, Wenjie Li, Peng Cheng, Yeyun Gong
cs.AI
摘要
高质量预训练数据对于大型语言模型至关重要,其中质量体现为事实可靠性与语义价值,而多样性则确保了广泛的覆盖面和分布异质性。现有方法通常依赖于单一或多维度的评分筛选。然而,直接选取高分数据往往导致性能下降,需从更广范围内采样以恢复效果。数据集评分与下游基准结果间的非单调性揭示了一个根本性偏差:基于评分的方法压缩了相关维度,使得高分数据看似优质却系统性忽视了多样性。我们主张,确保多样性需将相关指标分解为正交特征维度,从中可直接选取高分数据。为此,我们提出了正交多样性感知选择(ODiS)算法,在数据筛选中兼顾质量与多样性。首先,ODiS从多个维度评估数据,涵盖语言质量、知识质量及理解难度。随后,通过主成分分析(PCA)去相关化多维评分,得到正交评估维度。针对每一维度,训练一个基于Roberta的评分器,将数据回归至PCA投影得分,实现大规模语料库的可扩展推理。最后,ODiS通过在每个正交维度内选取高分数据构建训练集,从而确保质量与多样性。实证结果显示,ODiS筛选的数据维度间重叠率低于2%,验证了维度的正交性。更重要的是,使用ODiS筛选数据训练的模型在下游基准测试中显著超越其他基线,凸显了正交、多样性感知数据选择对大型语言模型的必要性。
English
High-quality pre-training data is crutial for large language models, where
quality captures factual reliability and semantic value, and diversity ensures
broad coverage and distributional heterogeneity. Existing approaches typically
rely on single or multiple-dimensional score-based selection. However, directly
selecting top-scored data often degrades performance, and sampling from a
broader range is required to recover results. The above non-monotonicity
between dataset scores and downstream benchmark results reveals a fundamental
bias: score-based methods collapse correlated dimensions, causing top-scored
data to appear high-quality while systematically overlooking diversity. We
argue that ensuring diversity requires decomposing correlated metrics into
orthogonal feature dimensions, from which the top-scored data can be directly
selected. Therefore, we proposed the Orthogonal Diversity-Aware Selection
(ODiS) algorithm, which preserves both quality and diversity during data
selection. First, ODiS evaluates data from multiple dimensions, covering
language quality, knowledge quality, and comprehension difficulty. The
multi-dimensional scores are then decorrelated via Principal Component Analysis
(PCA), yielding orthogonal evaluation dimensions. For each dimension, a
Roberta-based scorer is trained to regress the data onto PCA-projected scores,
enabling scalable inference on large corpora. Finally, ODiS constructs the
training dataset by selecting top-scored data within each orthogonal dimension,
thereby ensuring both quality and diversity. Empirical results show that
ODiS-selected data exhibit less than 2\% inter-dimension overlap, confirming
orthogonality between dimensions. More importantly, models trained with
ODiS-selected data significantly outperform other baselines on downstream
benchmarks, highlighting the necessity of orthogonal, diversity-aware data
selection for LLMs.