SciLitLLM: 과학 문헌 이해를 위한 LLM 적응 방법

초록

과학 문헌 이해는 특정 정보를 추출하고 통찰력을 얻어 과학 발견을 크게 발전시키는 데 중요합니다. 대형 언어 모델(LLM)의 놀라운 성공에도 불구하고, 그들은 주로 (1) 과학적 지식의 부족과 (2) 전문적인 과학 작업에 대한 익숙하지 않음으로 인해 과학 문헌 이해에 어려움을 겪습니다. 과학 문헌 이해에 특화된 LLM을 개발하기 위해, 과학 도메인 지식을 동시에 주입하고 도메인별 작업을 위한 지시 따르기 능력을 향상시키기 위해 계속적 사전 훈련(CPT)과 지도된 세밀한 미세 조정(SFT)을 통합하는 하이브리드 전략을 제안합니다. 이 과정에서 두 가지 주요 도전 과제를 식별합니다: (1) 고품질 CPT 말뭉치 구축 및 (2) 다양한 SFT 지시 생성. 우리는 PDF 텍스트 추출, 구문 내용 오류 수정, 품질 필터링 및 합성 지시 생성을 포함한 세심한 파이프라인을 통해 이러한 도전 과제에 대처합니다. 이 전략을 적용하여, 우리는 과학 문헌 이해에 특화된 SciLitLLM이라는 일련의 LLM을 제시합니다. 이 모델들은 과학 문헌 이해 벤치마크에서 융통성 있는 성능을 보여줍니다. 우리의 기여는 세 가지로 나뉩니다: (1) LLM을 과학 문헌 이해에 적응시키기 위해 CPT와 SFT를 통합하는 효과적인 프레임워크를 제시하며, 이는 다른 도메인에 쉽게 적용할 수 있습니다. (2) 다양하고 고품질의 과학 지시 생성을 위한 LLM 기반 합성 방법을 제안하여, 새로운 지시 세트인 SciLitIns를 생성하여 적은 표현된 과학 도메인에서의 지도된 세밀한 미세 조정을 위한 것입니다. (3) SciLitLLM은 과학 문헌 이해 벤치마크에서 융통성 있는 성능 향상을 달성합니다.

English

Scientific literature understanding is crucial for extracting targeted information and garnering insights, thereby significantly advancing scientific discovery. Despite the remarkable success of Large Language Models (LLMs), they face challenges in scientific literature understanding, primarily due to (1) a lack of scientific knowledge and (2) unfamiliarity with specialized scientific tasks. To develop an LLM specialized in scientific literature understanding, we propose a hybrid strategy that integrates continual pre-training (CPT) and supervised fine-tuning (SFT), to simultaneously infuse scientific domain knowledge and enhance instruction-following capabilities for domain-specific tasks.cIn this process, we identify two key challenges: (1) constructing high-quality CPT corpora, and (2) generating diverse SFT instructions. We address these challenges through a meticulous pipeline, including PDF text extraction, parsing content error correction, quality filtering, and synthetic instruction creation. Applying this strategy, we present a suite of LLMs: SciLitLLM, specialized in scientific literature understanding. These models demonstrate promising performance on scientific literature understanding benchmarks. Our contributions are threefold: (1) We present an effective framework that integrates CPT and SFT to adapt LLMs to scientific literature understanding, which can also be easily adapted to other domains. (2) We propose an LLM-based synthesis method to generate diverse and high-quality scientific instructions, resulting in a new instruction set -- SciLitIns -- for supervised fine-tuning in less-represented scientific domains. (3) SciLitLLM achieves promising performance improvements on scientific literature understanding benchmarks.

SciLitLLM: 과학 문헌 이해를 위한 LLM 적응 방법

SciLitLLM: How to Adapt LLMs for Scientific Literature Understanding

초록

Support