ChatPaper.aiChatPaper

SciLitLLM:如何適應科學文獻理解

SciLitLLM: How to Adapt LLMs for Scientific Literature Understanding

August 28, 2024
作者: Sihang Li, Jin Huang, Jiaxi Zhuang, Yaorui Shi, Xiaochen Cai, Mingjun Xu, Xiang Wang, Linfeng Zhang, Guolin Ke, Hengxing Cai
cs.AI

摘要

科學文獻理解對於提取目標信息並獲取洞察至關重要,從而顯著推動科學發現。儘管大型語言模型(LLMs)取得了顯著成功,但它們在科學文獻理解方面面臨挑戰,主要是由於(1)缺乏科學知識和(2)對專業科學任務的陌生感。 為了開發一個專門從事科學文獻理解的LLM,我們提出了一種混合策略,該策略整合了持續預訓練(CPT)和監督微調(SFT),以同時注入科學領域知識並增強對領域特定任務的指示遵循能力。在這個過程中,我們確定了兩個關鍵挑戰:(1)構建高質量的CPT語料庫,和(2)生成多樣的SFT指令。我們通過一個細緻的流程來應對這些挑戰,包括PDF文本提取、解析內容錯誤校正、質量過濾和合成指令創建。應用這一策略,我們提出了一系列LLMs:SciLitLLM,專門從事科學文獻理解。這些模型在科學文獻理解基準測試中展現出有希望的表現。 我們的貢獻有三個方面:(1)我們提出了一個有效的框架,將CPT和SFT整合在一起,以適應LLMs對科學文獻理解的需求,同時也可以輕鬆適應其他領域。(2)我們提出了一種基於LLM的合成方法,用於生成多樣且高質量的科學指令,從而產生一個新的指令集--SciLitIns--用於在少數科學領域進行監督微調。(3)SciLitLLM在科學文獻理解基準測試中實現了有希望的性能改進。
English
Scientific literature understanding is crucial for extracting targeted information and garnering insights, thereby significantly advancing scientific discovery. Despite the remarkable success of Large Language Models (LLMs), they face challenges in scientific literature understanding, primarily due to (1) a lack of scientific knowledge and (2) unfamiliarity with specialized scientific tasks. To develop an LLM specialized in scientific literature understanding, we propose a hybrid strategy that integrates continual pre-training (CPT) and supervised fine-tuning (SFT), to simultaneously infuse scientific domain knowledge and enhance instruction-following capabilities for domain-specific tasks.cIn this process, we identify two key challenges: (1) constructing high-quality CPT corpora, and (2) generating diverse SFT instructions. We address these challenges through a meticulous pipeline, including PDF text extraction, parsing content error correction, quality filtering, and synthetic instruction creation. Applying this strategy, we present a suite of LLMs: SciLitLLM, specialized in scientific literature understanding. These models demonstrate promising performance on scientific literature understanding benchmarks. Our contributions are threefold: (1) We present an effective framework that integrates CPT and SFT to adapt LLMs to scientific literature understanding, which can also be easily adapted to other domains. (2) We propose an LLM-based synthesis method to generate diverse and high-quality scientific instructions, resulting in a new instruction set -- SciLitIns -- for supervised fine-tuning in less-represented scientific domains. (3) SciLitLLM achieves promising performance improvements on scientific literature understanding benchmarks.

Summary

AI-Generated Summary

PDF381November 16, 2024