SampleMix: 데이터 품질과 다양성 조정을 통한 샘플 단위 사전 학습 데이터 혼합 전략

초록

대규모 언어 모델(LLM)을 위한 기존의 사전 학습 데이터 혼합 방법은 일반적으로 도메인별 방법론을 따르며, 이는 상향식 프로세스로 먼저 도메인 가중치를 결정한 후 각 도메인 내에서 균일한 데이터 샘플링을 수행합니다. 그러나 이러한 접근 방식은 중요한 도메인 간 중첩과 공통점을 간과하여 구성된 훈련 데이터셋의 전역적 다양성을 제어하지 못합니다. 더욱이, 도메인 내에서의 균일한 샘플링은 세부적인 샘플별 특성을 무시함으로써 최적이 아닌 데이터 분포를 초래할 수 있습니다. 이러한 단점을 해결하기 위해, 우리는 하향식 패러다임을 기반으로 한 새로운 샘플별 데이터 혼합 접근법을 제안합니다. 이 방법은 각 샘플의 품질과 다양성을 체계적으로 평가함으로써 전역적인 도메인 간 샘플링을 수행하며, 이를 통해 최적의 도메인 분포를 동적으로 결정합니다. 다수의 다운스트림 작업과 복잡도 평가를 통해 수행된 포괄적인 실험은 SampleMix가 기존의 도메인 기반 방법을 능가함을 보여줍니다. 한편, SampleMix는 기준 성능을 달성하기 위해 1.4배에서 2.1배의 훈련 단계가 필요하며, 이는 SampleMix가 사전 학습 데이터를 최적화할 수 있는 상당한 잠재력을 강조합니다.

English

Existing pretraining data mixing methods for large language models (LLMs) typically follow a domain-wise methodology, a top-down process that first determines domain weights and then performs uniform data sampling across each domain. However, these approaches neglect significant inter-domain overlaps and commonalities, failing to control the global diversity of the constructed training dataset. Further, uniform sampling within domains ignores fine-grained sample-specific features, potentially leading to suboptimal data distribution. To address these shortcomings, we propose a novel sample-wise data mixture approach based on a bottom-up paradigm. This method performs global cross-domain sampling by systematically evaluating the quality and diversity of each sample, thereby dynamically determining the optimal domain distribution. Comprehensive experiments across multiple downstream tasks and perplexity assessments demonstrate that SampleMix surpasses existing domain-based methods. Meanwhile, SampleMix requires 1.4x to 2.1x training steps to achieves the baselines' performance, highlighting the substantial potential of SampleMix to optimize pre-training data.

SampleMix: 데이터 품질과 다양성 조정을 통한 샘플 단위 사전 학습 데이터 혼합 전략

SampleMix: A Sample-wise Pre-training Data Mixing Strategey by Coordinating Data Quality and Diversity

초록

Support