文脈内事前学習：文書の境界を超えた言語モデリング

要旨

大規模言語モデル（LM）は現在、文書の接頭辞が与えられた際にトークンを予測するように訓練されており、これにより長文生成や文書完成に還元可能なプロンプトスタイルのタスクを直接実行できる。既存の事前学習パイプラインでは、短い文書をランダムに連結して入力コンテキストを作成することでLMを訓練するが、前の文書は次の文書を予測するための信号を提供しない。代わりに、我々はIn-Context Pretrainingという新しいアプローチを提案する。これは、言語モデルを関連する文書のシーケンスで事前学習させることで、文書の境界を越えて読み解き推論することを明示的に促すものである。In-Context Pretrainingは、各コンテキストに関連する文書が含まれるように文書の順序を変更し、既存の事前学習パイプラインを直接適用することで実現できる。しかし、この文書ソート問題は困難である。数十億の文書があり、データを繰り返すことなく、すべての文書に対してコンテキストの類似性を最大化するソートを望んでいる。これを実現するために、効率的な最近傍探索を用いて関連文書を見つけ、グラフ探索アルゴリズムを用いて一貫性のある入力コンテキストを構築する近似アルゴリズムを導入する。我々の実験では、In-Context PretrainingがLMの性能を大幅に向上させるシンプルでスケーラブルなアプローチを提供することが示されている。具体的には、より複雑なコンテキスト推論を必要とするタスクにおいて顕著な改善が見られ、インコンテキスト学習（+8%）、読解力（+15%）、以前のコンテキストに対する忠実性（+16%）、長文推論（+5%）、検索拡張（+9%）などが向上した。

English

Large language models (LMs) are currently trained to predict tokens given document prefixes, enabling them to directly perform long-form generation and prompting-style tasks which can be reduced to document completion. Existing pretraining pipelines train LMs by concatenating random sets of short documents to create input contexts but the prior documents provide no signal for predicting the next document. We instead present In-Context Pretraining, a new approach where language models are pretrained on a sequence of related documents, thereby explicitly encouraging them to read and reason across document boundaries. We can do In-Context Pretraining by simply changing the document ordering so that each context contains related documents, and directly applying existing pretraining pipelines. However, this document sorting problem is challenging. There are billions of documents and we would like the sort to maximize contextual similarity for every document without repeating any data. To do this, we introduce approximate algorithms for finding related documents with efficient nearest neighbor search and constructing coherent input contexts with a graph traversal algorithm. Our experiments show In-Context Pretraining offers a simple and scalable approach to significantly enhance LMs'performance: we see notable improvements in tasks that require more complex contextual reasoning, including in-context learning (+8%), reading comprehension (+15%), faithfulness to previous contexts (+16%), long-context reasoning (+5%), and retrieval augmentation (+9%).

文脈内事前学習：文書の境界を超えた言語モデリング

In-Context Pretraining: Language Modeling Beyond Document Boundaries

要旨

Support