ScholarCopilot: 正確な引用を伴う学術執筆のための大規模言語モデルのトレーニング

要旨

学術論文の執筆には、首尾一貫したテキスト生成と関連文献の正確な引用が求められる。近年のRetrieval-Augmented Generation（RAG）システムは、汎用テキスト生成における事実の正確性を大幅に向上させたが、専門的な学術論文執筆を十分に支援する能力はまだ限られている。本研究では、既存の大規模言語モデルを強化し、正確で文脈に即した引用を含む専門的な学術論文を生成するための統合フレームワークであるScholarCopilotを提案する。ScholarCopilotは、検索トークン[RET]を生成することで学術文献の検索タイミングを動的に決定し、その表現を利用してデータベースから関連する引用文献を検索する。検索された文献はモデルに入力され、生成プロセスを強化する。生成タスクと引用タスクを単一フレームワーク内で共同最適化することで効率性を高める。arXivの50万件の論文で訓練された我々のモデルは、評価データセットにおいてトップ1の検索精度40.1%を達成し、E5-Mistral-7B-Instruct（15.0%）やBM25（9.8%）などのベースラインを上回った。1,000件の学術論文サンプルからなるデータセットでは、ScholarCopilotは生成品質（関連性、一貫性、学術的厳密性、完全性、革新性を測定）で16.2/25点を獲得し、Qwen-2.5-72B-Instruct（15.8/25）などパラメータ数が10倍のモデルを凌駕した。人間による評価でも、ScholarCopilotは引用の再現性、執筆効率、全体的なユーザーエクスペリエンスにおいて優れた性能を示し、我々のアプローチの有効性が確認された。

English

Academic writing requires both coherent text generation and precise citation of relevant literature. Although recent Retrieval-Augmented Generation (RAG) systems have significantly improved factual accuracy in general-purpose text generation, their capacity to adequately support professional academic writing remains limited. In this work, we introduce ScholarCopilot, a unified framework designed to enhance existing large language models for generating professional academic articles with accurate and contextually relevant citations. ScholarCopilot dynamically determines when to retrieve scholarly references by generating a retrieval token [RET], and then utilizes its representation to look up relevant citations from a database. The retrieved references are fed into the model to augment the generation process. We jointly optimize both the generation and citation tasks within a single framework to increase efficiency. Trained on 500K papers from arXiv, our model achieves a top-1 retrieval accuracy of 40.1% on our evaluation dataset, outperforming baselines such as E5-Mistral-7B-Instruct (15.0%) and BM25 (9.8%). On a dataset of 1,000 academic writing samples, ScholarCopilot scores 16.2/25 in generation quality (measured across relevance, coherence, academic rigor, completeness, and innovation), surpassing models with 10x more parameters such as Qwen-2.5-72B-Instruct (15.8/25). Human studies also confirm ScholarCopilot's superior performance in citation recall, writing efficiency, and overall user experience, confirming the effectiveness of our approach.

ScholarCopilot: 正確な引用を伴う学術執筆のための大規模言語モデルのトレーニング

ScholarCopilot: Training Large Language Models for Academic Writing with Accurate Citations

要旨

Support