문맥 내 사전 학습: 문서 경계를 넘어선 언어 모델링

초록

대규모 언어 모델(LMs)은 현재 문서 접두사가 주어졌을 때 토큰을 예측하도록 훈련되어, 장문 생성 및 문서 완료로 축소될 수 있는 프롬프트 스타일 작업을 직접 수행할 수 있습니다. 기존의 사전 훈련 파이프라인은 짧은 문서들을 무작위로 연결하여 입력 컨텍스트를 생성함으로써 LMs를 훈련시키지만, 이전 문서들은 다음 문서를 예측하는 데 아무런 신호를 제공하지 않습니다. 우리는 대신 'In-Context Pretraining'이라는 새로운 접근 방식을 제시합니다. 이 방법은 언어 모델이 관련 문서들의 시퀀스에 대해 사전 훈련을 받도록 하여, 문서 경계를 넘어 읽고 추론하도록 명시적으로 장려합니다. 우리는 단순히 문서 순서를 변경하여 각 컨텍스트가 관련 문서들을 포함하도록 하고, 기존의 사전 훈련 파이프라인을 직접 적용함으로써 In-Context Pretraining을 수행할 수 있습니다. 그러나 이 문서 정렬 문제는 도전적입니다. 수십억 개의 문서가 존재하며, 데이터를 반복하지 않으면서 모든 문서에 대해 컨텍스트 유사성을 극대화하는 정렬을 원합니다. 이를 위해, 우리는 효율적인 최근접 이웃 탐색을 통해 관련 문서를 찾고, 그래프 순회 알고리즘을 사용하여 일관된 입력 컨텍스트를 구성하는 근사 알고리즘을 도입합니다. 우리의 실험 결과, In-Context Pretraining은 LMs의 성능을 크게 향상시키는 간단하고 확장 가능한 접근 방식을 제공합니다: 컨텍스트 내 학습(+8%), 독해(+15%), 이전 컨텍스트에 대한 충실도(+16%), 장문 컨텍스트 추론(+5%), 검색 증강(+9%) 등 더 복잡한 컨텍스트 추론이 필요한 작업에서 주목할 만한 개선이 관찰되었습니다.

English

Large language models (LMs) are currently trained to predict tokens given document prefixes, enabling them to directly perform long-form generation and prompting-style tasks which can be reduced to document completion. Existing pretraining pipelines train LMs by concatenating random sets of short documents to create input contexts but the prior documents provide no signal for predicting the next document. We instead present In-Context Pretraining, a new approach where language models are pretrained on a sequence of related documents, thereby explicitly encouraging them to read and reason across document boundaries. We can do In-Context Pretraining by simply changing the document ordering so that each context contains related documents, and directly applying existing pretraining pipelines. However, this document sorting problem is challenging. There are billions of documents and we would like the sort to maximize contextual similarity for every document without repeating any data. To do this, we introduce approximate algorithms for finding related documents with efficient nearest neighbor search and constructing coherent input contexts with a graph traversal algorithm. Our experiments show In-Context Pretraining offers a simple and scalable approach to significantly enhance LMs'performance: we see notable improvements in tasks that require more complex contextual reasoning, including in-context learning (+8%), reading comprehension (+15%), faithfulness to previous contexts (+16%), long-context reasoning (+5%), and retrieval augmentation (+9%).

문맥 내 사전 학습: 문서 경계를 넘어선 언어 모델링

In-Context Pretraining: Language Modeling Beyond Document Boundaries

초록

Support