TiKMiX:将数据影响力融入语言模型预训练的动态混合机制
TiKMiX: Take Data Influence into Dynamic Mixture for Language Model Pre-training
August 25, 2025
作者: Yifan Wang, Binbin Liu, Fengze Liu, Yuanfan Guo, Jiyao Deng, Xuecheng Wu, Weidong Zhou, Xiaohuan Zhou, Taifeng Wang
cs.AI
摘要
语言模型预训练中使用的数据混合策略是其最终性能的基石。然而,静态的混合策略并非最优,因为模型对不同数据领域的学习偏好会在训练过程中动态变化。关键在于,如何以计算高效的方式观察这些不断演变的偏好仍是一个重大挑战。为此,我们提出了TiKMiX方法,它根据模型不断变化的偏好动态调整数据混合比例。TiKMiX引入了“群体影响力”这一高效指标,用于评估数据领域对模型的影响。该指标使得数据混合问题转化为寻找最优、影响力最大化的分布。我们通过两种方法解决这一问题:TiKMiX-D用于直接优化,而TiKMiX-M则利用回归模型预测更优的混合比例。我们在多达1万亿个token的数据上训练了不同参数规模的模型。TiKMiX-D在仅使用20%计算资源的情况下,性能超越了REGMIX等最先进方法。TiKMiX-M在9个下游基准测试中平均带来了2%的性能提升。我们的实验表明,模型的数据偏好随训练进度和规模而演变,并且我们证明了基于“群体影响力”——这些偏好的直接度量——动态调整数据混合比例,通过缓解静态比例下数据消化不足的问题,显著提升了性能。
English
The data mixture used in the pre-training of a language model is a
cornerstone of its final performance. However, a static mixing strategy is
suboptimal, as the model's learning preferences for various data domains shift
dynamically throughout training. Crucially, observing these evolving
preferences in a computationally efficient manner remains a significant
challenge. To address this, we propose TiKMiX, a method that dynamically
adjusts the data mixture according to the model's evolving preferences. TiKMiX
introduces Group Influence, an efficient metric for evaluating the impact of
data domains on the model. This metric enables the formulation of the data
mixing problem as a search for an optimal, influence-maximizing distribution.
We solve this via two approaches: TiKMiX-D for direct optimization, and
TiKMiX-M, which uses a regression model to predict a superior mixture. We
trained models with different numbers of parameters, on up to 1 trillion
tokens. TiKMiX-D exceeds the performance of state-of-the-art methods like
REGMIX while using just 20% of the computational resources. TiKMiX-M leads to
an average performance gain of 2% across 9 downstream benchmarks. Our
experiments reveal that a model's data preferences evolve with training
progress and scale, and we demonstrate that dynamically adjusting the data
mixture based on Group Influence, a direct measure of these preferences,
significantly improves performance by mitigating the underdigestion of data
seen with static ratios.