跨語言監督改善大型語言模型預訓練。

摘要

最近在預訓練大型語言模型方面取得的快速進展，主要依賴於使用自監督語言建模目標，如下一個 token 預測或跨度損壞。另一方面，機器翻譯系統主要是使用需要在源語言和目標語言之間對齊的數據的跨語言監督進行訓練。我們證明，在大型語言模型的預訓練中，混合自監督語言建模目標和監督機器翻譯目標，因此在預訓練期間包含跨語言平行數據，可以產生具有更好上下文學習能力的模型。由於預訓練是一個非常資源密集的過程，而在兩個目標之間找到最佳混合比例的網格搜索成本過高，因此我們提出了一種簡單而有效的策略，在預訓練期間學習這個比例。

English

The recent rapid progress in pre-training Large Language Models has relied on using self-supervised language modeling objectives like next token prediction or span corruption. On the other hand, Machine Translation Systems are mostly trained using cross-lingual supervision that requires aligned data between source and target languages. We demonstrate that pre-training Large Language Models on a mixture of a self-supervised Language Modeling objective and the supervised Machine Translation objective, therefore including cross-lingual parallel data during pre-training, yields models with better in-context learning abilities. As pre-training is a very resource-intensive process and a grid search on the best mixing ratio between the two objectives is prohibitively expensive, we propose a simple yet effective strategy to learn it during pre-training.

跨語言監督改善大型語言模型預訓練。

Cross-Lingual Supervision improves Large Language Models Pre-training

摘要

Support