向最佳學習,但以不同方式:基於多樣性驅動的數據選擇再思考
Learning from the Best, Differently: A Diversity-Driven Rethinking on Data Selection
October 21, 2025
作者: Hongyi He, Xiao Liu, Zhenghao Lin, Mingni Tang, Yi Cheng, Jintao Wang, Wenjie Li, Peng Cheng, Yeyun Gong
cs.AI
摘要
高品質的預訓練數據對於大型語言模型至關重要,其中品質體現了事實可靠性和語義價值,而多樣性則確保了廣泛的覆蓋面和分佈異質性。現有方法通常依賴於單一或多維度基於分數的選擇。然而,直接選擇最高分數據往往會降低性能,需要從更廣泛的範圍中抽樣以恢復結果。數據集分數與下游基準結果之間的非單調性揭示了一個根本性偏差:基於分數的方法會壓縮相關維度,導致最高分數據看似高品質,卻系統性地忽視了多樣性。我們認為,確保多樣性需要將相關指標分解為正交的特徵維度,從中可以直接選擇最高分數據。因此,我們提出了正交多樣性感知選擇(ODiS)算法,該算法在數據選擇過程中同時保留了品質和多樣性。首先,ODiS從多個維度評估數據,涵蓋語言品質、知識品質和理解難度。然後,通過主成分分析(PCA)對多維度分數進行去相關,得到正交的評估維度。對於每個維度,訓練一個基於Roberta的評分器,將數據回歸到PCA投影分數上,從而實現對大規模語料庫的可擴展推理。最後,ODiS通過在每個正交維度內選擇最高分數據來構建訓練數據集,從而確保品質和多樣性。實驗結果顯示,ODiS選擇的數據在維度間的重疊率低於2%,證實了維度之間的正交性。更重要的是,使用ODiS選擇的數據訓練的模型在下游基準測試中顯著優於其他基線,凸顯了正交、多樣性感知數據選擇對於大型語言模型的必要性。
English
High-quality pre-training data is crutial for large language models, where
quality captures factual reliability and semantic value, and diversity ensures
broad coverage and distributional heterogeneity. Existing approaches typically
rely on single or multiple-dimensional score-based selection. However, directly
selecting top-scored data often degrades performance, and sampling from a
broader range is required to recover results. The above non-monotonicity
between dataset scores and downstream benchmark results reveals a fundamental
bias: score-based methods collapse correlated dimensions, causing top-scored
data to appear high-quality while systematically overlooking diversity. We
argue that ensuring diversity requires decomposing correlated metrics into
orthogonal feature dimensions, from which the top-scored data can be directly
selected. Therefore, we proposed the Orthogonal Diversity-Aware Selection
(ODiS) algorithm, which preserves both quality and diversity during data
selection. First, ODiS evaluates data from multiple dimensions, covering
language quality, knowledge quality, and comprehension difficulty. The
multi-dimensional scores are then decorrelated via Principal Component Analysis
(PCA), yielding orthogonal evaluation dimensions. For each dimension, a
Roberta-based scorer is trained to regress the data onto PCA-projected scores,
enabling scalable inference on large corpora. Finally, ODiS constructs the
training dataset by selecting top-scored data within each orthogonal dimension,
thereby ensuring both quality and diversity. Empirical results show that
ODiS-selected data exhibit less than 2\% inter-dimension overlap, confirming
orthogonality between dimensions. More importantly, models trained with
ODiS-selected data significantly outperform other baselines on downstream
benchmarks, highlighting the necessity of orthogonal, diversity-aware data
selection for LLMs.