揭秘数据受限语言模型预训练中的训练时数据增强

摘要

随着人工智能实验室逼近数据上限——即计算能力超过高质量新文本生成速率——语言模型的预训练正转向数据受限、计算充裕的范式，这要求对固定语料库进行高效的多轮训练。在这种情境下，标准的自回归（AR）预训练会严重过拟合，在达到最优值后持续恶化。我们研究了训练时的数据增强作为正则化手段，以缓解这种过拟合，并实现在相同数据上进行数百轮的有效训练。我们为AR预训练引入了三类正交的增强方法：词元级噪声（掩码、随机替换）、序列排列（从右到左预测、中间填充）以及目标偏移预测（预测x_{t+i}，其中i > 1）。通过系统的消融实验，我们发现相较于基线，单独的增强方法能推迟过拟合并降低验证损失，其中随机词元替换在单独方法中取得了最佳的最小损失。结合不同类别的增强方法能进一步降低最小验证损失。我们的实验表明，数据增强能缓解AR预训练的数据效率低下问题，并为数据受限的范式提供了一种有前景的解决方案~\footnote{所有代码和数据均可在 https://github.com/michaelchen-lab/data-augmentations-for-pretraining 获取。}

English

As AI labs approach a data ceiling where compute capacity outpaces the rate of new high-quality text generation, language model pretraining is shifting toward a data-constrained, compute-abundant regime that demands productive multi-epoch training on fixed corpora. Standard autoregressive (AR) pretraining overfits severely in this setting, reaching its optimum early and then continuously deteriorating. We investigate training-time data augmentation as a regularizer to mitigate this overfitting and enable productive training for hundreds of epochs on the same data. We introduce three orthogonal categories of augmentation for AR pretraining: token-level noise (masking, random replacement), sequence permutations (right-to-left prediction, Fill-in-the-Middle), and target offset prediction (x_{t+i} for i > 1). Through systematic ablations, we find that individual augmentations delay overfitting and lower validation loss relative to the baseline, with random token replacement achieving the best minimum loss among individual methods. Combining augmentation categories further lowers the minimum validation loss. Our experiments demonstrate that data augmentations mitigate AR pretraining's data inefficiency and offer a promising solution to the data-constrained regime~\footnote{All code and data are available at https://github.com/ michaelchen-lab/ data-augmentations-for-pretraining.