GeoGalactica：地球科学における科学的大規模言語モデル

要旨

大規模言語モデル（LLM）は、その汎用的な知識と自然言語処理（NLP）における幅広いタスクを解決する能力により、大きな成功を収めています。その印象的な能力により、LLMは特定の分野における科学的発見を促進するための学際的な応用の可能性に光を当てています（AI for Science、AI4S）。一方で、地学研究と実践におけるNLP技術の活用は広範かつ複雑で、知識抽出や文書分類から質問応答や知識発見に至るまで多岐にわたります。本研究では、比較的単純なアプローチを通じて、LLMを科学に活用するための最初の一歩を踏み出します。具体的には、地学分野の膨大なテキストを用いてモデルをさらに事前学習し、その結果得られたモデルを独自に収集した指示チューニングデータセットで教師ありファインチューニング（SFT）することで、LLMを地学に特化させようと試みます。これらの取り組みにより、300億のパラメータからなるモデル「GeoGalactica」が誕生しました。私たちの知る限り、これは地学分野における最大の言語モデルです。より具体的には、GeoGalacticaはGalacticaを基にさらに事前学習を行ったものです。私たちは、大規模科学プロジェクト「Deep-time Digital Earth（DDE）」から収集した650億トークンからなる地学関連テキストコーパスを用いてGeoGalacticaを学習させました。これは、地学に特化した最大のテキストコーパスとして保存されています。その後、専門的な地学知識を必要とする質問からなる100万ペアの指示チューニングデータを用いてモデルをファインチューニングしました。本技術レポートでは、GeoGalacticaのデータ収集、データクリーニング、ベースモデルの選択、事前学習、SFT、評価など、すべての側面を詳細に説明します。また、データキュレーションツールと、事前学習の最初の3/4の期間におけるGeoGalacticaのチェックポイントをオープンソースとして公開します。

English

Large language models (LLMs) have achieved huge success for their general knowledge and ability to solve a wide spectrum of tasks in natural language processing (NLP). Due to their impressive abilities, LLMs have shed light on potential inter-discipline applications to foster scientific discoveries of a specific domain by using artificial intelligence (AI for science, AI4S). In the meantime, utilizing NLP techniques in geoscience research and practice is wide and convoluted, contributing from knowledge extraction and document classification to question answering and knowledge discovery. In this work, we take the initial step to leverage LLM for science, through a rather straightforward approach. We try to specialize an LLM into geoscience, by further pre-training the model with a vast amount of texts in geoscience, as well as supervised fine-tuning (SFT) the resulting model with our custom collected instruction tuning dataset. These efforts result in a model GeoGalactica consisting of 30 billion parameters. To our best knowledge, it is the largest language model for the geoscience domain. More specifically, GeoGalactica is from further pre-training of Galactica. We train GeoGalactica over a geoscience-related text corpus containing 65 billion tokens curated from extensive data sources in the big science project Deep-time Digital Earth (DDE), preserving as the largest geoscience-specific text corpus. Then we fine-tune the model with 1 million pairs of instruction-tuning data consisting of questions that demand professional geoscience knowledge to answer. In this technical report, we will illustrate in detail all aspects of GeoGalactica, including data collection, data cleaning, base model selection, pre-training, SFT, and evaluation. We open-source our data curation tools and the checkpoints of GeoGalactica during the first 3/4 of pre-training.

GeoGalactica：地球科学における科学的大規模言語モデル

GeoGalactica: A Scientific Large Language Model in Geoscience

要旨

Support