GeoGalactica: 지구과학 분야의 과학적 대형 언어 모델

초록

대형 언어 모델(LLM)은 일반적인 지식과 자연어 처리(NLP) 분야의 다양한 과제를 해결하는 능력으로 인해 큰 성공을 거두었습니다. 이러한 인상적인 능력 덕분에 LLM은 인공지능(AI)을 활용하여 특정 분야의 과학적 발견을 촉진하는 잠재적인 학제 간 응용 가능성을 제시하고 있습니다(AI for Science, AI4S). 한편, 지구과학 연구 및 실무에서 NLP 기술을 활용하는 범위는 넓고 복잡하며, 지식 추출과 문서 분류부터 질의응답 및 지식 발견에 이르기까지 다양한 기여를 하고 있습니다. 본 연구에서는 비교적 단순한 접근 방식을 통해 LLM을 과학 분야에 활용하기 위한 첫걸음을 내딛습니다. 우리는 방대한 지구과학 텍스트를 추가로 사전 학습하고, 이를 기반으로 수집한 맞춤형 지시 튜닝 데이터셋으로 지도 미세 조정(SFT)을 수행하여 LLM을 지구과학에 특화시키려고 합니다. 이러한 노력의 결과로 300억 개의 매개변수로 구성된 GeoGalactica 모델이 탄생했습니다. 우리가 아는 한, 이는 지구과학 분야에서 가장 큰 언어 모델입니다. 보다 구체적으로, GeoGalactica는 Galactica를 추가 사전 학습한 모델입니다. 우리는 대형 과학 프로젝트인 Deep-time Digital Earth(DDE)의 광범위한 데이터 소스에서 선별된 650억 개의 토큰으로 구성된 지구과학 관련 텍스트 코퍼스를 사용하여 GeoGalactica를 학습시켰으며, 이는 지구과학 특화 텍스트 코퍼스로는 가장 큰 규모를 유지하고 있습니다. 그런 다음 전문 지구과학 지식을 요구하는 질문으로 구성된 100만 쌍의 지시 튜닝 데이터로 모델을 미세 조정했습니다. 본 기술 보고서에서는 데이터 수집, 데이터 정제, 기본 모델 선택, 사전 학습, SFT 및 평가를 포함한 GeoGalactica의 모든 측면을 상세히 설명합니다. 우리는 데이터 큐레이션 도구와 사전 학습 초기 3/4 동안의 GeoGalactica 체크포인트를 오픈소스로 공개합니다.

English

Large language models (LLMs) have achieved huge success for their general knowledge and ability to solve a wide spectrum of tasks in natural language processing (NLP). Due to their impressive abilities, LLMs have shed light on potential inter-discipline applications to foster scientific discoveries of a specific domain by using artificial intelligence (AI for science, AI4S). In the meantime, utilizing NLP techniques in geoscience research and practice is wide and convoluted, contributing from knowledge extraction and document classification to question answering and knowledge discovery. In this work, we take the initial step to leverage LLM for science, through a rather straightforward approach. We try to specialize an LLM into geoscience, by further pre-training the model with a vast amount of texts in geoscience, as well as supervised fine-tuning (SFT) the resulting model with our custom collected instruction tuning dataset. These efforts result in a model GeoGalactica consisting of 30 billion parameters. To our best knowledge, it is the largest language model for the geoscience domain. More specifically, GeoGalactica is from further pre-training of Galactica. We train GeoGalactica over a geoscience-related text corpus containing 65 billion tokens curated from extensive data sources in the big science project Deep-time Digital Earth (DDE), preserving as the largest geoscience-specific text corpus. Then we fine-tune the model with 1 million pairs of instruction-tuning data consisting of questions that demand professional geoscience knowledge to answer. In this technical report, we will illustrate in detail all aspects of GeoGalactica, including data collection, data cleaning, base model selection, pre-training, SFT, and evaluation. We open-source our data curation tools and the checkpoints of GeoGalactica during the first 3/4 of pre-training.

GeoGalactica: 지구과학 분야의 과학적 대형 언어 모델

GeoGalactica: A Scientific Large Language Model in Geoscience

초록

Support