上下文预训练：超越文档边界的语言建模

摘要

目前大型语言模型（LMs）被训练来预测给定文档前缀的标记，使其能够直接执行长文本生成和提示式任务，这些任务可以简化为文档完成。现有的预训练流程通过连接随机的短文档集来训练LMs，以创建输入上下文，但先前的文档对预测下一个文档没有提供信号。相反，我们提出了上下文预训练（In-Context Pretraining）这一新方法，其中语言模型在一系列相关文档上进行预训练，从而明确鼓励它们跨越文档边界阅读和推理。我们可以通过简单地改变文档排序，使每个上下文包含相关文档，并直接应用现有的预训练流程来进行上下文预训练。然而，这个文档排序问题具有挑战性。有数十亿个文档，我们希望排序能够最大程度地增加每个文档的上下文相似性，而不重复任何数据。为此，我们引入了用于查找相关文档的近邻搜索和构建具有连贯输入上下文的图遍历算法的近似算法。我们的实验表明，上下文预训练提供了一种简单且可扩展的方法来显著提升LMs的性能：我们在需要更复杂上下文推理的任务中看到了显著的改进，包括上下文学习（+8%）、阅读理解（+15%）、对先前上下文的忠实度（+16%）、长上下文推理（+5%）和检索增强（+9%）。

English

Large language models (LMs) are currently trained to predict tokens given document prefixes, enabling them to directly perform long-form generation and prompting-style tasks which can be reduced to document completion. Existing pretraining pipelines train LMs by concatenating random sets of short documents to create input contexts but the prior documents provide no signal for predicting the next document. We instead present In-Context Pretraining, a new approach where language models are pretrained on a sequence of related documents, thereby explicitly encouraging them to read and reason across document boundaries. We can do In-Context Pretraining by simply changing the document ordering so that each context contains related documents, and directly applying existing pretraining pipelines. However, this document sorting problem is challenging. There are billions of documents and we would like the sort to maximize contextual similarity for every document without repeating any data. To do this, we introduce approximate algorithms for finding related documents with efficient nearest neighbor search and constructing coherent input contexts with a graph traversal algorithm. Our experiments show In-Context Pretraining offers a simple and scalable approach to significantly enhance LMs'performance: we see notable improvements in tasks that require more complex contextual reasoning, including in-context learning (+8%), reading comprehension (+15%), faithfulness to previous contexts (+16%), long-context reasoning (+5%), and retrieval augmentation (+9%).

上下文预训练：超越文档边界的语言建模

In-Context Pretraining: Language Modeling Beyond Document Boundaries

摘要

Support