网页重述:一种计算和数据高效的语言建模方法
Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling
January 29, 2024
作者: Pratyush Maini, Skyler Seto, He Bai, David Grangier, Yizhe Zhang, Navdeep Jaitly
cs.AI
摘要
大型语言模型是在网络上大规模抓取的数据上进行训练的,这些数据通常是非结构化的、嘈杂的,并且措辞不佳。当前的扩展规律表明,从这些数据中学习需要大量的计算资源和数据,随着被训练模型的规模增加而增长。由于预训练涉及巨大的计算成本和时间,并且网络上高质量数据的稀缺性日益临近,这变得不可行。在这项工作中,我们提出了Web Rephrase Augmented Pre-training(WRAP),它使用一个经过调整的通用指导模型,提示对网络文档进行改写,以特定风格,如“类似维基百科”或“问答格式”,共同对LLM进行预训练。首先,我们展示了在自然嘈杂的C4数据集上使用WRAP可以将预训练加速3倍。在相同的预训练计算预算下,它平均提高了Pile不同子集的困惑度超过10%,并且将零样本问题回答准确性在13个任务中提高了超过2%。其次,我们研究了改写风格对模型性能的影响,提供了有关训练数据组成如何影响LLM在OOD环境中性能的见解。我们的收益归因于改写的合成数据比仅有真实数据具有更高的效用,因为它(i)融入了紧密反映下游评估风格的风格多样性,以及(ii)比网络抓取数据具有更高的“质量”。
English
Large language models are trained on massive scrapes of the web, which are
often unstructured, noisy, and poorly phrased. Current scaling laws show that
learning from such data requires an abundance of both compute and data, which
grows with the size of the model being trained. This is infeasible both because
of the large compute costs and duration associated with pre-training, and the
impending scarcity of high-quality data on the web. In this work, we propose
Web Rephrase Augmented Pre-training (WRAP) that uses an
off-the-shelf instruction-tuned model prompted to paraphrase documents on the
web in specific styles such as "like Wikipedia" or in "question-answer format"
to jointly pre-train LLMs on real and synthetic rephrases. First, we show that
using WRAP on the C4 dataset, which is naturally noisy, speeds up pre-training
by sim3x. At the same pre-training compute budget, it improves perplexity by
more than 10% on average across different subsets of the Pile, and improves
zero-shot question answer accuracy across 13 tasks by more than 2%. Second, we
investigate the impact of the re-phrasing style on the performance of the
model, offering insights into how the composition of the training data can
impact the performance of LLMs in OOD settings. Our gains are attributed to the
fact that re-phrased synthetic data has higher utility than just real data
because it (i) incorporates style diversity that closely reflects downstream
evaluation style, and (ii) has higher 'quality' than web-scraped data.