128K 컨텍스트로 언어 모델을 확장하기 위한 데이터 엔지니어링

초록

우리는 언어 모델의 컨텍스트 길이를 128K로 확장하기 위한 지속적 사전 학습 방법론을 데이터 엔지니어링에 초점을 맞춰 연구합니다. 우리는 특히 임의의 입력 위치에서 정보를 활용할 수 있는 능력인 긴 컨텍스트 모델링이 대규모 사전 학습을 통해 이미 대부분 습득된 능력이며, 적절한 데이터 혼합물에 대한 경량의 지속적 사전 학습을 통해 이를 훈련 중에 접한 것보다 훨씬 더 긴 컨텍스트(예: 4K에서 128K)로 쉽게 확장할 수 있다고 가정합니다. 우리는 지속적 사전 학습을 위한 데이터의 양과 질을 조사합니다: (1) 양의 측면에서, 5억에서 50억 개의 토큰만으로도 모델이 128K 컨텍스트 내 어디에서든 정보를 검색할 수 있음을 보여줍니다; (2) 질의 측면에서, 우리의 결과는 도메인 균형과 길이 업샘플링을 동등하게 강조합니다. 구체적으로, 기존 연구에서 흔히 사용되는 방식인 책과 같은 특정 도메인에서 더 긴 데이터를 단순히 업샘플링하는 것은 최적의 성능을 내지 못하며, 균형 잡힌 도메인 혼합이 중요함을 발견했습니다. 우리는 이러한 데이터 10억에서 50억 토큰에 대해 전체 모델을 지속적으로 사전 학습하는 것이 언어 모델의 컨텍스트 길이를 128K로 확장하는 효과적이고 경제적인 전략임을 입증합니다. 우리의 방법론은 강력한 오픈소스 긴 컨텍스트 모델들을 능가하며 GPT-4 128K와 같은 최첨단 모델과의 격차를 줄입니다.

English

We study the continual pretraining recipe for scaling language models' context lengths to 128K, with a focus on data engineering. We hypothesize that long context modeling, in particular the ability to utilize information at arbitrary input locations, is a capability that is mostly already acquired through large-scale pretraining, and that this capability can be readily extended to contexts substantially longer than seen during training~(e.g., 4K to 128K) through lightweight continual pretraining on appropriate data mixture. We investigate the quantity and quality of the data for continual pretraining: (1) for quantity, we show that 500 million to 5 billion tokens are enough to enable the model to retrieve information anywhere within the 128K context; (2) for quality, our results equally emphasize domain balance and length upsampling. Concretely, we find that naively upsampling longer data on certain domains like books, a common practice of existing work, gives suboptimal performance, and that a balanced domain mixture is important. We demonstrate that continual pretraining of the full model on 1B-5B tokens of such data is an effective and affordable strategy for scaling the context length of language models to 128K. Our recipe outperforms strong open-source long-context models and closes the gap to frontier models like GPT-4 128K.

128K 컨텍스트로 언어 모델을 확장하기 위한 데이터 엔지니어링

Data Engineering for Scaling Language Models to 128K Context

초록

Support