通过阅读理解调整大型语言模型

摘要

我们探讨了在特定领域语料库上持续进行预训练如何影响大型语言模型，发现在原始语料库上训练赋予模型领域知识，但极大地损害了其用于问答的提示能力。受人类通过阅读理解进行学习的启发——阅读后进行练习可以提高根据所学知识回答问题的能力——我们提出了一种将原始语料库转化为阅读理解文本的简单方法。每个原始文本都会被丰富为一系列与其内容相关的任务。我们的方法，高度可扩展且适用于任何预训练语料库，持续增强了在三个不同领域的各种任务中的性能。特别地，我们的7B语言模型在与规模大得多的特定领域模型（如BloombergGPT-50B）竞争性能方面表现出色。此外，我们证明了特定领域的阅读理解文本甚至可以提高模型在通用基准上的性能，显示了开发跨更多领域通用模型的潜力。我们的模型、代码和数据将在https://github.com/microsoft/LMOps 上提供。

English

We explore how continued pre-training on domain-specific corpora influences large language models, revealing that training on the raw corpora endows the model with domain knowledge, but drastically hurts its prompting ability for question answering. Taken inspiration from human learning via reading comprehension--practice after reading improves the ability to answer questions based on the learned knowledge--we propose a simple method for transforming raw corpora into reading comprehension texts. Each raw text is enriched with a series of tasks related to its content. Our method, highly scalable and applicable to any pre-training corpora, consistently enhances performance across various tasks in three different domains: biomedicine, finance, and law. Notably, our 7B language model achieves competitive performance with domain-specific models of much larger scales, such as BloombergGPT-50B. Furthermore, we demonstrate that domain-specific reading comprehension texts can improve the model's performance even on general benchmarks, showing the potential to develop a general model across even more domains. Our model, code, and data will be available at https://github.com/microsoft/LMOps.

通过阅读理解调整大型语言模型

Adapting Large Language Models via Reading Comprehension

摘要

Support