SciLitLLM: 科学文献理解のためのLLMの適応方法

要旨

科学文献の理解は、特定の情報を抽出し洞察を得るために極めて重要であり、科学的発見を大幅に推進します。大規模言語モデル（LLM）の顕著な成功にもかかわらず、科学文献の理解においては、主に科学的知識の不足と専門的な科学的タスクへの不慣れさによる課題があります。科学文献の理解に特化したLLMを開発するために、私たちは継続的事前学習（CPT）と監督されたファインチューニング（SFT）を統合するハイブリッド戦略を提案します。これにより、科学的ドメイン知識を同時に注入し、特定のドメインタスクの指示に従う能力を向上させます。このプロセスでは、2つの主要な課題を特定しています。1つは高品質なCPTコーパスの構築、もう1つは多様なSFT指示の生成です。これらの課題に対処するために、PDFテキストの抽出、コンテンツエラーの解析、品質フィルタリング、合成指示の作成などを含む入念なパイプラインを構築しています。この戦略を適用し、科学文献の理解に特化したSciLitLLMという一連のLLMを提案しています。これらのモデルは、科学文献の理解のベンチマークで有望なパフォーマンスを示しています。私たちの貢献は3つあります。1つ目は、LLMを科学文献の理解に適応させるためにCPTとSFTを統合する効果的なフレームワークを提示し、他のドメインにも簡単に適応できることです。2つ目は、多様で高品質な科学的指示を生成するためのLLMベースの合成方法を提案し、未代表的な科学的ドメイン向けの監督されたファインチューニング用の新しい指示セットであるSciLitInsを生み出します。3つ目は、SciLitLLMが科学文献の理解のベンチマークで有望なパフォーマンス向上を達成していることです。

English

Scientific literature understanding is crucial for extracting targeted information and garnering insights, thereby significantly advancing scientific discovery. Despite the remarkable success of Large Language Models (LLMs), they face challenges in scientific literature understanding, primarily due to (1) a lack of scientific knowledge and (2) unfamiliarity with specialized scientific tasks. To develop an LLM specialized in scientific literature understanding, we propose a hybrid strategy that integrates continual pre-training (CPT) and supervised fine-tuning (SFT), to simultaneously infuse scientific domain knowledge and enhance instruction-following capabilities for domain-specific tasks.cIn this process, we identify two key challenges: (1) constructing high-quality CPT corpora, and (2) generating diverse SFT instructions. We address these challenges through a meticulous pipeline, including PDF text extraction, parsing content error correction, quality filtering, and synthetic instruction creation. Applying this strategy, we present a suite of LLMs: SciLitLLM, specialized in scientific literature understanding. These models demonstrate promising performance on scientific literature understanding benchmarks. Our contributions are threefold: (1) We present an effective framework that integrates CPT and SFT to adapt LLMs to scientific literature understanding, which can also be easily adapted to other domains. (2) We propose an LLM-based synthesis method to generate diverse and high-quality scientific instructions, resulting in a new instruction set -- SciLitIns -- for supervised fine-tuning in less-represented scientific domains. (3) SciLitLLM achieves promising performance improvements on scientific literature understanding benchmarks.

SciLitLLM: 科学文献理解のためのLLMの適応方法

SciLitLLM: How to Adapt LLMs for Scientific Literature Understanding

要旨

Support