QuaDMix: 효율적인 대형 언어 모델 사전 학습을 위한 품질-다양성 균형 데이터 선택

초록

품질과 다양성은 대규모 언어 모델(LLM)의 학습 데이터에 있어 두 가지 중요한 지표로, 모델 성능에 긍정적인 영향을 미칩니다. 기존 연구들은 주로 이러한 지표들을 개별적으로 최적화하는데, 일반적으로 먼저 품질 필터링을 적용한 후 데이터 비율을 조정하는 방식으로 접근합니다. 그러나 이러한 접근법은 품질과 다양성 간의 본질적인 상충 관계를 간과하며, 이 둘을 함께 고려할 필요가 있습니다. 고정된 학습 할당량이 주어졌을 때, 각 데이터 포인트의 품질과 전체 데이터셋에 대한 보완적 효과를 모두 평가하는 것이 중요합니다. 본 논문에서는 QuaDMix라는 통합 데이터 선택 프레임워크를 소개하며, 이는 품질과 다양성을 균형 있게 조절하면서 LLM 사전 학습을 위한 데이터 분포를 자동으로 최적화합니다. 구체적으로, 먼저 데이터 품질을 측정하기 위한 다중 기준을 제안하고, 도메인 분류를 통해 데이터 포인트를 구분함으로써 전반적인 다양성을 측정합니다. QuaDMix는 이러한 품질 및 다양성 관련 레이블을 기반으로 각 데이터 포인트의 샘플링 확률을 결정하는 통합 파라미터화된 데이터 샘플링 함수를 사용합니다. QuaDMix 프레임워크 내 최적 파라미터 탐색을 가속화하기 위해, 우리는 더 작은 모델에 대한 시뮬레이션 실험을 수행하고 RegMix 방법에서 영감을 받아 LightGBM을 사용하여 파라미터 탐색을 진행합니다. 다양한 모델과 데이터셋에 걸친 실험 결과, QuaDMix는 여러 벤치마크에서 평균 7.2%의 성능 향상을 달성했습니다. 이러한 결과는 품질과 다양성을 개별적으로 최적화하는 전략을 능가하며, 데이터 품질과 다양성의 균형을 맞추는 것의 필요성과 능력을 강조합니다.

English

Quality and diversity are two critical metrics for the training data of large language models (LLMs), positively impacting performance. Existing studies often optimize these metrics separately, typically by first applying quality filtering and then adjusting data proportions. However, these approaches overlook the inherent trade-off between quality and diversity, necessitating their joint consideration. Given a fixed training quota, it is essential to evaluate both the quality of each data point and its complementary effect on the overall dataset. In this paper, we introduce a unified data selection framework called QuaDMix, which automatically optimizes the data distribution for LLM pretraining while balancing both quality and diversity. Specifically, we first propose multiple criteria to measure data quality and employ domain classification to distinguish data points, thereby measuring overall diversity. QuaDMix then employs a unified parameterized data sampling function that determines the sampling probability of each data point based on these quality and diversity related labels. To accelerate the search for the optimal parameters involved in the QuaDMix framework, we conduct simulated experiments on smaller models and use LightGBM for parameters searching, inspired by the RegMix method. Our experiments across diverse models and datasets demonstrate that QuaDMix achieves an average performance improvement of 7.2% across multiple benchmarks. These results outperform the independent strategies for quality and diversity, highlighting the necessity and ability to balance data quality and diversity.

QuaDMix: 효율적인 대형 언어 모델 사전 학습을 위한 품질-다양성 균형 데이터 선택

QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining

초록

Support