通过阅读理解调整大型语言模型
Adapting Large Language Models via Reading Comprehension
September 18, 2023
作者: Daixuan Cheng, Shaohan Huang, Furu Wei
cs.AI
摘要
我们探讨了在特定领域语料库上持续进行预训练如何影响大型语言模型,发现在原始语料库上训练赋予模型领域知识,但极大地损害了其用于问答的提示能力。受人类通过阅读理解进行学习的启发——阅读后进行练习可以提高根据所学知识回答问题的能力——我们提出了一种将原始语料库转化为阅读理解文本的简单方法。每个原始文本都会被丰富为一系列与其内容相关的任务。我们的方法,高度可扩展且适用于任何预训练语料库,持续增强了在三个不同领域的各种任务中的性能。特别地,我们的7B语言模型在与规模大得多的特定领域模型(如BloombergGPT-50B)竞争性能方面表现出色。此外,我们证明了特定领域的阅读理解文本甚至可以提高模型在通用基准上的性能,显示了开发跨更多领域通用模型的潜力。我们的模型、代码和数据将在https://github.com/microsoft/LMOps 上提供。
English
We explore how continued pre-training on domain-specific corpora influences
large language models, revealing that training on the raw corpora endows the
model with domain knowledge, but drastically hurts its prompting ability for
question answering. Taken inspiration from human learning via reading
comprehension--practice after reading improves the ability to answer questions
based on the learned knowledge--we propose a simple method for transforming raw
corpora into reading comprehension texts. Each raw text is enriched with a
series of tasks related to its content. Our method, highly scalable and
applicable to any pre-training corpora, consistently enhances performance
across various tasks in three different domains: biomedicine, finance, and law.
Notably, our 7B language model achieves competitive performance with
domain-specific models of much larger scales, such as BloombergGPT-50B.
Furthermore, we demonstrate that domain-specific reading comprehension texts
can improve the model's performance even on general benchmarks, showing the
potential to develop a general model across even more domains. Our model, code,
and data will be available at https://github.com/microsoft/LMOps.