網頁重述：一個節省計算和數據的語言建模方法

摘要

大型語言模型是在網絡上龐大的抓取資料上進行訓練的，這些資料通常是非結構化的、噪聲干擾大且措辭不佳。目前的擴展規律顯示，從這些資料中學習需要豐富的計算資源和數據，並且隨著被訓練模型的大小而增長。由於預訓練所需的大量計算成本和時間，以及網絡上高質量資料的日益稀缺，這變得不可行。在這項研究中，我們提出了Web Rephrase Augmented Pre-training (WRAP) 方法，該方法利用一個即插即用的指令調整模型，提示對網絡上的文件進行改寫，以特定風格，如“類似維基百科”或“問答格式”，來聯合預訓練大型語言模型(LLMs)的真實和合成改寫。首先，我們展示了在自然噪聲干擾大的C4數據集上使用WRAP可以將預訓練速度提高至3倍。在相同的預訓練計算預算下，它平均改善了Pile不同子集的困惑度超過10％，並且在13個任務的零-shot問答準確性上提高了超過2％。其次，我們研究了改寫風格對模型性能的影響，提供了有關訓練資料組成如何影響LLMs在OOD環境中性能的見解。我們的收益歸因於改寫的合成資料比僅有真實資料具有更高的效用，因為它(i)包含緊密反映下游評估風格的風格多樣性，並且(ii)比網絡抓取資料具有更高的“質量”。

English

Large language models are trained on massive scrapes of the web, which are often unstructured, noisy, and poorly phrased. Current scaling laws show that learning from such data requires an abundance of both compute and data, which grows with the size of the model being trained. This is infeasible both because of the large compute costs and duration associated with pre-training, and the impending scarcity of high-quality data on the web. In this work, we propose Web Rephrase Augmented Pre-training (WRAP) that uses an off-the-shelf instruction-tuned model prompted to paraphrase documents on the web in specific styles such as "like Wikipedia" or in "question-answer format" to jointly pre-train LLMs on real and synthetic rephrases. First, we show that using WRAP on the C4 dataset, which is naturally noisy, speeds up pre-training by sim3x. At the same pre-training compute budget, it improves perplexity by more than 10% on average across different subsets of the Pile, and improves zero-shot question answer accuracy across 13 tasks by more than 2%. Second, we investigate the impact of the re-phrasing style on the performance of the model, offering insights into how the composition of the training data can impact the performance of LLMs in OOD settings. Our gains are attributed to the fact that re-phrased synthetic data has higher utility than just real data because it (i) incorporates style diversity that closely reflects downstream evaluation style, and (ii) has higher 'quality' than web-scraped data.

網頁重述：一個節省計算和數據的語言建模方法

Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling

摘要

Support