ChatPaper.aiChatPaper

SampleMix:一種基於樣本層次的預訓練數據混合策略,協調數據質量與多樣性

SampleMix: A Sample-wise Pre-training Data Mixing Strategey by Coordinating Data Quality and Diversity

March 3, 2025
作者: Xiangyu Xi, Deyang Kong, Jian Yang, Jiawei Yang, Zhengyu Chen, Wei Wang, Jingang Wang, Xunliang Cai, Shikun Zhang, Wei Ye
cs.AI

摘要

現有的大型語言模型(LLM)預訓練數據混合方法通常遵循一種領域導向的方法論,這是一種自上而下的過程,首先確定各領域的權重,然後在每個領域內進行均勻的數據採樣。然而,這些方法忽略了顯著的跨領域重疊與共性,未能有效控制構建訓練數據集的全局多樣性。此外,領域內的均勻採樣忽視了細粒度的樣本特徵,可能導致次優的數據分佈。為解決這些不足,我們提出了一種基於自下而上範式的新穎樣本級數據混合方法。該方法通過系統評估每個樣本的質量與多樣性來執行全局跨領域採樣,從而動態確定最優的領域分佈。在多個下游任務及困惑度評估中的全面實驗表明,SampleMix超越了現有的基於領域的方法。同時,SampleMix需要1.4倍至2.1倍的訓練步數來達到基線性能,這凸顯了SampleMix在優化預訓練數據方面的巨大潛力。
English
Existing pretraining data mixing methods for large language models (LLMs) typically follow a domain-wise methodology, a top-down process that first determines domain weights and then performs uniform data sampling across each domain. However, these approaches neglect significant inter-domain overlaps and commonalities, failing to control the global diversity of the constructed training dataset. Further, uniform sampling within domains ignores fine-grained sample-specific features, potentially leading to suboptimal data distribution. To address these shortcomings, we propose a novel sample-wise data mixture approach based on a bottom-up paradigm. This method performs global cross-domain sampling by systematically evaluating the quality and diversity of each sample, thereby dynamically determining the optimal domain distribution. Comprehensive experiments across multiple downstream tasks and perplexity assessments demonstrate that SampleMix surpasses existing domain-based methods. Meanwhile, SampleMix requires 1.4x to 2.1x training steps to achieves the baselines' performance, highlighting the substantial potential of SampleMix to optimize pre-training data.

Summary

AI-Generated Summary

PDF92March 4, 2025