通過閱讀理解來適應大型語言模型

摘要

我們探討在特定領域語料庫上持續預訓練對大型語言模型的影響，發現在原始語料庫上訓練賦予模型領域知識，但極大損害了其在問答方面的提示能力。受到人類通過閱讀理解來提高回答基於所學知識問題能力的啟發，我們提出了一種將原始語料庫轉換為閱讀理解文本的簡單方法。每個原始文本都會豐富其內容相關的一系列任務。我們的方法非常可擴展，適用於任何預訓練語料庫，並在三個不同領域（生物醫學、金融和法律）的各種任務中持續提升性能。值得注意的是，我們的 7B 語言模型實現了與規模遠大得多的特定領域模型（如 BloombergGPT-50B）競爭力的表現。此外，我們證明了特定領域的閱讀理解文本甚至可以提高模型在通用基準上的性能，展示了開發跨更多領域通用模型的潛力。我們的模型、代碼和數據將在 https://github.com/microsoft/LMOps 上提供。

English

We explore how continued pre-training on domain-specific corpora influences large language models, revealing that training on the raw corpora endows the model with domain knowledge, but drastically hurts its prompting ability for question answering. Taken inspiration from human learning via reading comprehension--practice after reading improves the ability to answer questions based on the learned knowledge--we propose a simple method for transforming raw corpora into reading comprehension texts. Each raw text is enriched with a series of tasks related to its content. Our method, highly scalable and applicable to any pre-training corpora, consistently enhances performance across various tasks in three different domains: biomedicine, finance, and law. Notably, our 7B language model achieves competitive performance with domain-specific models of much larger scales, such as BloombergGPT-50B. Furthermore, we demonstrate that domain-specific reading comprehension texts can improve the model's performance even on general benchmarks, showing the potential to develop a general model across even more domains. Our model, code, and data will be available at https://github.com/microsoft/LMOps.

通過閱讀理解來適應大型語言模型

Adapting Large Language Models via Reading Comprehension

摘要

Support