地球星际:地球科学领域的科学大型语言模型
GeoGalactica: A Scientific Large Language Model in Geoscience
December 31, 2023
作者: Zhouhan Lin, Cheng Deng, Le Zhou, Tianhang Zhang, Yi Xu, Yutong Xu, Zhongmou He, Yuanyuan Shi, Beiya Dai, Yunchong Song, Boyi Zeng, Qiyuan Chen, Tao Shi, Tianyu Huang, Yiwei Xu, Shu Wang, Luoyi Fu, Weinan Zhang, Junxian He, Chao Ma, Yunqiang Zhu, Xinbing Wang, Chenghu Zhou
cs.AI
摘要
大型语言模型(LLMs)以其广泛的知识和解决自然语言处理(NLP)中各种任务的能力取得了巨大成功。由于其令人印象深刻的能力,LLMs为利用人工智能(AI for science,AI4S)促进特定领域科学发现的潜在跨学科应用提供了启示。同时,在地球科学研究和实践中广泛而复杂地利用NLP技术,从知识提取和文档分类到问题回答和知识发现都有贡献。在这项工作中,我们迈出了利用LLM进行科学研究的初步步骤,采用了一种相当直接的方法。我们尝试将一个LLM专门用于地球科学,通过进一步对模型进行大量地球科学文本的预训练,以及使用我们自定义收集的指导调整数据集对结果模型进行监督微调(SFT)。这些努力产生了一个包含300亿参数的模型GeoGalactica。据我们所知,这是地球科学领域最大的语言模型。更具体地说,GeoGalactica是对Galactica进行进一步预训练的结果。我们使用从大科学项目Deep-time Digital Earth(DDE)的广泛数据源中筛选出的包含650亿标记的地球科学相关文本语料库对GeoGalactica进行训练,保留为最大的地球科学特定文本语料库。然后,我们使用包含100万对指导调整数据的模型微调数据,其中包含需要专业地球科学知识才能回答的问题。在这份技术报告中,我们将详细阐述GeoGalactica的所有方面,包括数据收集、数据清洗、基础模型选择、预训练、SFT和评估。我们开源我们的数据筛选工具以及GeoGalactica在预训练的前3/4阶段的检查点。
English
Large language models (LLMs) have achieved huge success for their general
knowledge and ability to solve a wide spectrum of tasks in natural language
processing (NLP). Due to their impressive abilities, LLMs have shed light on
potential inter-discipline applications to foster scientific discoveries of a
specific domain by using artificial intelligence (AI for science, AI4S). In the
meantime, utilizing NLP techniques in geoscience research and practice is wide
and convoluted, contributing from knowledge extraction and document
classification to question answering and knowledge discovery. In this work, we
take the initial step to leverage LLM for science, through a rather
straightforward approach. We try to specialize an LLM into geoscience, by
further pre-training the model with a vast amount of texts in geoscience, as
well as supervised fine-tuning (SFT) the resulting model with our custom
collected instruction tuning dataset. These efforts result in a model
GeoGalactica consisting of 30 billion parameters. To our best knowledge, it is
the largest language model for the geoscience domain. More specifically,
GeoGalactica is from further pre-training of Galactica. We train GeoGalactica
over a geoscience-related text corpus containing 65 billion tokens curated from
extensive data sources in the big science project Deep-time Digital Earth
(DDE), preserving as the largest geoscience-specific text corpus. Then we
fine-tune the model with 1 million pairs of instruction-tuning data consisting
of questions that demand professional geoscience knowledge to answer. In this
technical report, we will illustrate in detail all aspects of GeoGalactica,
including data collection, data cleaning, base model selection, pre-training,
SFT, and evaluation. We open-source our data curation tools and the checkpoints
of GeoGalactica during the first 3/4 of pre-training.