跨语言监督改善大型语言模型的预训练

摘要

最近在预训练大型语言模型方面取得的快速进展依赖于使用自监督语言建模目标，如下一个标记预测或跨度损坏。另一方面，机器翻译系统主要是使用需要源语言和目标语言之间对齐数据的跨语言监督进行训练。我们证明，在自监督语言建模目标和监督机器翻译目标的混合下对大型语言模型进行预训练，因此在预训练过程中包括跨语言平行数据，可以产生具有更好上下文学习能力的模型。由于预训练是一个非常资源密集的过程，并且在两个目标之间找到最佳混合比例的网格搜索成本过高，因此我们提出了一种简单而有效的策略，在预训练过程中学习这种比例。

English

The recent rapid progress in pre-training Large Language Models has relied on using self-supervised language modeling objectives like next token prediction or span corruption. On the other hand, Machine Translation Systems are mostly trained using cross-lingual supervision that requires aligned data between source and target languages. We demonstrate that pre-training Large Language Models on a mixture of a self-supervised Language Modeling objective and the supervised Machine Translation objective, therefore including cross-lingual parallel data during pre-training, yields models with better in-context learning abilities. As pre-training is a very resource-intensive process and a grid search on the best mixing ratio between the two objectives is prohibitively expensive, we propose a simple yet effective strategy to learn it during pre-training.

跨语言监督改善大型语言模型的预训练

Cross-Lingual Supervision improves Large Language Models Pre-training

摘要

Support