ChatPaper.aiChatPaper

R&B:領域重組與數據混合平衡以實現高效基礎模型訓練

R&B: Domain Regrouping and Data Mixture Balancing for Efficient Foundation Model Training

May 1, 2025
作者: Albert Ge, Tzu-Heng Huang, John Cooper, Avi Trost, Ziyi Chu, Satya Sai Srinath Namburi GNVV, Ziyang Cai, Kendall Park, Nicholas Roberts, Frederic Sala
cs.AI

摘要

數據混合策略已成功降低了訓練語言模型的成本。儘管前景看好,這類方法仍存在兩個缺陷。首先,它們依賴於預先確定的數據領域(如數據來源、任務類型),這可能無法捕捉到關鍵的語義細微差別,從而限制了性能的提升。其次,這些方法在計算上隨著領域數量的增加而變得難以承受。我們通過R&B框架來應對這些挑戰,該框架基於語義相似性重新劃分訓練數據(重組)以創建更細粒度的領域,並利用訓練過程中獲得的領域梯度所誘導的Gram矩陣來高效優化數據組成(平衡)。與先前的工作不同,它消除了獲取評估信息(如損失或梯度)所需的額外計算需求。我們在標準的正則性條件下分析了這一技術,並提供了理論見解,證明了R&B相比非自適應混合方法的有效性。在實證研究中,我們展示了R&B在五個多樣化數據集上的有效性,這些數據集涵蓋了從自然語言到推理和多模態任務。僅需0.01%的額外計算開銷,R&B就能匹配或超越最先進的數據混合策略的性能。
English
Data mixing strategies have successfully reduced the costs involved in training language models. While promising, such methods suffer from two flaws. First, they rely on predetermined data domains (e.g., data sources, task types), which may fail to capture critical semantic nuances, leaving performance on the table. Second, these methods scale with the number of domains in a computationally prohibitive way. We address these challenges via R&B, a framework that re-partitions training data based on semantic similarity (Regroup) to create finer-grained domains, and efficiently optimizes the data composition (Balance) by leveraging a Gram matrix induced by domain gradients obtained throughout training. Unlike prior works, it removes the need for additional compute to obtain evaluation information such as losses or gradients. We analyze this technique under standard regularity conditions and provide theoretical insights that justify R&B's effectiveness compared to non-adaptive mixing approaches. Empirically, we demonstrate the effectiveness of R&B on five diverse datasets ranging from natural language to reasoning and multimodal tasks. With as little as 0.01% additional compute overhead, R&B matches or exceeds the performance of state-of-the-art data mixing strategies.

Summary

AI-Generated Summary

PDF171May 8, 2025