GeoGalactica:地球科學領域的科學大型語言模型
GeoGalactica: A Scientific Large Language Model in Geoscience
December 31, 2023
作者: Zhouhan Lin, Cheng Deng, Le Zhou, Tianhang Zhang, Yi Xu, Yutong Xu, Zhongmou He, Yuanyuan Shi, Beiya Dai, Yunchong Song, Boyi Zeng, Qiyuan Chen, Tao Shi, Tianyu Huang, Yiwei Xu, Shu Wang, Luoyi Fu, Weinan Zhang, Junxian He, Chao Ma, Yunqiang Zhu, Xinbing Wang, Chenghu Zhou
cs.AI
摘要
大型語言模型(LLMs)以其廣泛的知識和解決自然語言處理(NLP)中各種任務的能力取得了巨大成功。由於其令人印象深刻的能力,LLMs為利用人工智慧(AI for science, AI4S)促進特定領域科學發現的潛在跨學科應用帶來了新的希望。與此同時,在地球科學研究和實踐中利用NLP技術是廣泛而複雜的,從知識提取和文檔分類到問答和知識發現都有貢獻。在這項工作中,我們採取了一個比較直接的方法,利用LLM進行科學專業化。我們試圖將一個LLM專門化為地球科學,通過進一步對模型進行大量地球科學文本的預訓練,以及使用我們自定義收集的指導調整數據集對結果模型進行監督微調(SFT)。這些努力產生了一個包含300億參數的模型GeoGalactica。據我們所知,這是地球科學領域最大的語言模型。更具體地說,GeoGalactica是對Galactica進行進一步預訓練的結果。我們使用從大型科學項目Deep-time Digital Earth(DDE)的廣泛數據源中提煉出的包含650億標記的地球科學相關文本語料庫對GeoGalactica進行訓練,保留為最大的地球科學專用文本語料庫。然後,我們使用包含100萬對指導調整數據的問題對模型進行微調,這些問題需要專業的地球科學知識才能回答。在這份技術報告中,我們將詳細說明GeoGalactica的所有方面,包括數據收集、數據清理、基礎模型選擇、預訓練、SFT和評估。我們將我們的數據整理工具和GeoGalactica在前3/4預訓練期間的檢查點開源。
English
Large language models (LLMs) have achieved huge success for their general
knowledge and ability to solve a wide spectrum of tasks in natural language
processing (NLP). Due to their impressive abilities, LLMs have shed light on
potential inter-discipline applications to foster scientific discoveries of a
specific domain by using artificial intelligence (AI for science, AI4S). In the
meantime, utilizing NLP techniques in geoscience research and practice is wide
and convoluted, contributing from knowledge extraction and document
classification to question answering and knowledge discovery. In this work, we
take the initial step to leverage LLM for science, through a rather
straightforward approach. We try to specialize an LLM into geoscience, by
further pre-training the model with a vast amount of texts in geoscience, as
well as supervised fine-tuning (SFT) the resulting model with our custom
collected instruction tuning dataset. These efforts result in a model
GeoGalactica consisting of 30 billion parameters. To our best knowledge, it is
the largest language model for the geoscience domain. More specifically,
GeoGalactica is from further pre-training of Galactica. We train GeoGalactica
over a geoscience-related text corpus containing 65 billion tokens curated from
extensive data sources in the big science project Deep-time Digital Earth
(DDE), preserving as the largest geoscience-specific text corpus. Then we
fine-tune the model with 1 million pairs of instruction-tuning data consisting
of questions that demand professional geoscience knowledge to answer. In this
technical report, we will illustrate in detail all aspects of GeoGalactica,
including data collection, data cleaning, base model selection, pre-training,
SFT, and evaluation. We open-source our data curation tools and the checkpoints
of GeoGalactica during the first 3/4 of pre-training.