R&B:面向高效基础模型训练的领域重组与数据混合平衡
R&B: Domain Regrouping and Data Mixture Balancing for Efficient Foundation Model Training
May 1, 2025
作者: Albert Ge, Tzu-Heng Huang, John Cooper, Avi Trost, Ziyi Chu, Satya Sai Srinath Namburi GNVV, Ziyang Cai, Kendall Park, Nicholas Roberts, Frederic Sala
cs.AI
摘要
数据混合策略已成功降低了训练语言模型的成本。尽管前景广阔,这些方法仍存在两个缺陷。首先,它们依赖于预设的数据领域(如数据来源、任务类型),可能无法捕捉关键的语义细微差别,从而限制了性能提升。其次,这些方法随着领域数量的增加,计算成本呈指数级增长,难以承受。我们通过R&B框架应对这些挑战,该框架基于语义相似性重新划分训练数据(重组),以创建更细粒度的领域,并通过利用训练过程中获得的领域梯度诱导的Gram矩阵,高效优化数据构成(平衡)。与先前工作不同,它无需额外计算来获取评估信息,如损失或梯度。我们在标准正则条件下分析了这一技术,并提供了理论见解,证明了R&B相较于非自适应混合方法的有效性。实证方面,我们在从自然语言到推理及多模态任务的五个多样化数据集上验证了R&B的有效性。仅需0.01%的额外计算开销,R&B即达到或超越了最先进数据混合策略的性能。
English
Data mixing strategies have successfully reduced the costs involved in
training language models. While promising, such methods suffer from two flaws.
First, they rely on predetermined data domains (e.g., data sources, task
types), which may fail to capture critical semantic nuances, leaving
performance on the table. Second, these methods scale with the number of
domains in a computationally prohibitive way. We address these challenges via
R&B, a framework that re-partitions training data based on semantic similarity
(Regroup) to create finer-grained domains, and efficiently optimizes the data
composition (Balance) by leveraging a Gram matrix induced by domain gradients
obtained throughout training. Unlike prior works, it removes the need for
additional compute to obtain evaluation information such as losses or
gradients. We analyze this technique under standard regularity conditions and
provide theoretical insights that justify R&B's effectiveness compared to
non-adaptive mixing approaches. Empirically, we demonstrate the effectiveness
of R&B on five diverse datasets ranging from natural language to reasoning and
multimodal tasks. With as little as 0.01% additional compute overhead, R&B
matches or exceeds the performance of state-of-the-art data mixing strategies.Summary
AI-Generated Summary