INDUS: 科学技術応用のための効率的で効果的な言語モデル

要旨

一般領域のコーパスで訓練された大規模言語モデル（LLMs）は、自然言語処理（NLP）タスクにおいて顕著な成果を示してきました。しかし、過去の研究では、ドメイン特化型のコーパスを用いて訓練されたLLMsが専門的なタスクにおいてより優れた性能を発揮することが実証されています。この重要な知見に基づき、我々は地球科学、生物学、物理学、太陽物理学、惑星科学、天体物理学の各分野に特化したLLMsの包括的なスイートであるINDUSを開発しました。これらのモデルは、多様なデータソースから収集された精選された科学コーパスを用いて訓練されています。このスイートには以下のモデルが含まれます：（1）ドメイン固有の語彙とコーパスを用いて訓練されたエンコーダモデルで、自然言語理解タスクに対応します。（2）複数のソースから得られた多様なデータセットを用いて訓練された対照学習ベースの汎用テキスト埋め込みモデルで、情報検索タスクに対応します。（3）知識蒸留技術を用いて作成されたこれらのモデルの小型版で、レイテンシやリソース制約のあるアプリケーションに対応します。また、これらの多分野における研究を加速するために、新たに3つの科学ベンチマークデータセット（CLIMATE-CHANGE-NER（エンティティ認識）、NASA-QA（抽出型QA）、NASA-IR（IR））を作成しました。最後に、我々のモデルが、これらの新規タスクおよび関心領域の既存のベンチマークタスクにおいて、汎用エンコーダ（RoBERTa）および既存のドメイン特化型エンコーダ（SciBERT）を上回る性能を示すことを実証しました。

English

Large language models (LLMs) trained on general domain corpora showed remarkable results on natural language processing (NLP) tasks. However, previous research demonstrated LLMs trained using domain-focused corpora perform better on specialized tasks. Inspired by this pivotal insight, we developed INDUS, a comprehensive suite of LLMs tailored for the Earth science, biology, physics, heliophysics, planetary sciences and astrophysics domains and trained using curated scientific corpora drawn from diverse data sources. The suite of models include: (1) an encoder model trained using domain-specific vocabulary and corpora to address natural language understanding tasks, (2) a contrastive-learning-based general text embedding model trained using a diverse set of datasets drawn from multiple sources to address information retrieval tasks and (3) smaller versions of these models created using knowledge distillation techniques to address applications which have latency or resource constraints. We also created three new scientific benchmark datasets namely, CLIMATE-CHANGE-NER (entity-recognition), NASA-QA (extractive QA) and NASA-IR (IR) to accelerate research in these multi-disciplinary fields. Finally, we show that our models outperform both general-purpose encoders (RoBERTa) and existing domain-specific encoders (SciBERT) on these new tasks as well as existing benchmark tasks in the domains of interest.

INDUS: 科学技術応用のための効率的で効果的な言語モデル

INDUS: Effective and Efficient Language Models for Scientific Applications

要旨

Support