QuaDMix:面向高效大语言模型预训练的质量-多样性平衡数据选择
QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining
April 23, 2025
作者: Fengze Liu, Weidong Zhou, Binbin Liu, Zhimiao Yu, Yifan Zhang, Haobin Lin, Yifeng Yu, Xiaohuan Zhou, Taifeng Wang, Yong Cao
cs.AI
摘要
质量与多样性是大型语言模型(LLMs)训练数据的两大关键指标,对模型性能有着积极影响。现有研究往往分别优化这两项指标,通常先进行质量筛选,再调整数据比例。然而,这些方法忽视了质量与多样性之间固有的权衡关系,需要将二者综合考虑。在固定的训练配额下,评估每个数据点的质量及其对整体数据集的互补效应至关重要。本文提出了一种名为QuaDMix的统一数据选择框架,该框架在平衡质量与多样性的同时,自动优化LLM预训练的数据分布。具体而言,我们首先提出了多项标准来衡量数据质量,并通过领域分类区分数据点,以此衡量整体多样性。QuaDMix随后采用了一个统一的参数化数据采样函数,该函数基于这些与质量和多样性相关的标签来确定每个数据点的采样概率。为了加速QuaDMix框架中涉及的最优参数搜索,我们在较小模型上进行了模拟实验,并借鉴RegMix方法,使用LightGBM进行参数搜索。我们在多种模型和数据集上的实验表明,QuaDMix在多个基准测试中平均提升了7.2%的性能。这些结果超越了单独针对质量和多样性的策略,凸显了平衡数据质量与多样性的必要性和能力。
English
Quality and diversity are two critical metrics for the training data of large
language models (LLMs), positively impacting performance. Existing studies
often optimize these metrics separately, typically by first applying quality
filtering and then adjusting data proportions. However, these approaches
overlook the inherent trade-off between quality and diversity, necessitating
their joint consideration. Given a fixed training quota, it is essential to
evaluate both the quality of each data point and its complementary effect on
the overall dataset. In this paper, we introduce a unified data selection
framework called QuaDMix, which automatically optimizes the data distribution
for LLM pretraining while balancing both quality and diversity. Specifically,
we first propose multiple criteria to measure data quality and employ domain
classification to distinguish data points, thereby measuring overall diversity.
QuaDMix then employs a unified parameterized data sampling function that
determines the sampling probability of each data point based on these quality
and diversity related labels. To accelerate the search for the optimal
parameters involved in the QuaDMix framework, we conduct simulated experiments
on smaller models and use LightGBM for parameters searching, inspired by the
RegMix method. Our experiments across diverse models and datasets demonstrate
that QuaDMix achieves an average performance improvement of 7.2% across
multiple benchmarks. These results outperform the independent strategies for
quality and diversity, highlighting the necessity and ability to balance data
quality and diversity.Summary
AI-Generated Summary