大規模言語モデルの読解能力による適応

要旨

ドメイン固有のコーパスを用いた継続的な事前学習が大規模言語モデルに与える影響を探り、生のコーパスでの学習がモデルにドメイン知識を付与する一方で、質問応答のためのプロンプト能力を大幅に損なうことを明らかにしました。人間の読解を通じた学習（読んだ後に練習することで、学んだ知識に基づいて質問に答える能力が向上する）に着想を得て、生のコーパスを読解テキストに変換するシンプルな方法を提案します。各生のテキストは、その内容に関連する一連のタスクで強化されます。私たちの方法は、非常にスケーラブルで、あらゆる事前学習コーパスに適用可能であり、バイオメディシン、金融、法律という3つの異なるドメインにおける様々なタスクで一貫して性能を向上させます。特に、私たちの7B言語モデルは、BloombergGPT-50Bのようなはるかに大規模なドメイン固有モデルと競争力のある性能を達成します。さらに、ドメイン固有の読解テキストが、一般的なベンチマークにおいてもモデルの性能を向上させる可能性を示し、より多くのドメインにわたる汎用モデルを開発する可能性を示しています。私たちのモデル、コード、データはhttps://github.com/microsoft/LMOpsで公開されます。

English

We explore how continued pre-training on domain-specific corpora influences large language models, revealing that training on the raw corpora endows the model with domain knowledge, but drastically hurts its prompting ability for question answering. Taken inspiration from human learning via reading comprehension--practice after reading improves the ability to answer questions based on the learned knowledge--we propose a simple method for transforming raw corpora into reading comprehension texts. Each raw text is enriched with a series of tasks related to its content. Our method, highly scalable and applicable to any pre-training corpora, consistently enhances performance across various tasks in three different domains: biomedicine, finance, and law. Notably, our 7B language model achieves competitive performance with domain-specific models of much larger scales, such as BloombergGPT-50B. Furthermore, we demonstrate that domain-specific reading comprehension texts can improve the model's performance even on general benchmarks, showing the potential to develop a general model across even more domains. Our model, code, and data will be available at https://github.com/microsoft/LMOps.

大規模言語モデルの読解能力による適応

Adapting Large Language Models via Reading Comprehension

要旨

Support