대규모 언어 모델의 독해 능력을 통한 적응

초록

도메인 특화 코퍼스에 대한 지속적인 사전 학습이 대규모 언어 모델에 미치는 영향을 탐구한 결과, 원시 코퍼스에 대한 학습은 모델에 도메인 지식을 부여하지만 질문 응답을 위한 프롬프팅 능력을 크게 저하시키는 것으로 나타났습니다. 인간의 독해를 통한 학습 방식—즉, 읽은 후 연습을 통해 학습한 지식을 바탕으로 질문에 답변하는 능력이 향상되는 방식—에서 영감을 받아, 우리는 원시 코퍼스를 독해 텍스트로 변환하는 간단한 방법을 제안합니다. 각 원시 텍스트는 그 내용과 관련된 일련의 작업으로 보강됩니다. 우리의 방법은 매우 확장 가능하며 모든 사전 학습 코퍼스에 적용할 수 있으며, 생물의학, 금융, 법률 등 세 가지 다른 도메인에서 다양한 작업에 걸쳐 일관되게 성능을 향상시킵니다. 특히, 우리의 7B 언어 모델은 BloombergGPT-50B와 같은 훨씬 더 큰 규모의 도메인 특화 모델과 경쟁력 있는 성능을 달성합니다. 더 나아가, 도메인 특화 독해 텍스트가 일반 벤치마크에서도 모델의 성능을 향상시킬 수 있음을 보여주며, 더 많은 도메인에 걸친 일반 모델 개발의 잠재력을 보여줍니다. 우리의 모델, 코드, 데이터는 https://github.com/microsoft/LMOps에서 이용 가능할 예정입니다.

English

We explore how continued pre-training on domain-specific corpora influences large language models, revealing that training on the raw corpora endows the model with domain knowledge, but drastically hurts its prompting ability for question answering. Taken inspiration from human learning via reading comprehension--practice after reading improves the ability to answer questions based on the learned knowledge--we propose a simple method for transforming raw corpora into reading comprehension texts. Each raw text is enriched with a series of tasks related to its content. Our method, highly scalable and applicable to any pre-training corpora, consistently enhances performance across various tasks in three different domains: biomedicine, finance, and law. Notably, our 7B language model achieves competitive performance with domain-specific models of much larger scales, such as BloombergGPT-50B. Furthermore, we demonstrate that domain-specific reading comprehension texts can improve the model's performance even on general benchmarks, showing the potential to develop a general model across even more domains. Our model, code, and data will be available at https://github.com/microsoft/LMOps.

대규모 언어 모델의 독해 능력을 통한 적응

Adapting Large Language Models via Reading Comprehension

초록

Support