ScholarCopilot:訓練大型語言模型以實現精確引用的學術寫作
ScholarCopilot: Training Large Language Models for Academic Writing with Accurate Citations
April 1, 2025
作者: Yubo Wang, Xueguang Ma, Ping Nie, Huaye Zeng, Zhiheng Lyu, Yuxuan Zhang, Benjamin Schneider, Yi Lu, Xiang Yue, Wenhu Chen
cs.AI
摘要
學術寫作既需要連貫的文本生成,又要求精確引用相關文獻。儘管近期的檢索增強生成(RAG)系統在通用文本生成的事實準確性上取得了顯著提升,但其在支持專業學術寫作方面的能力仍顯不足。本研究提出了ScholarCopilot,這是一個統一框架,旨在增強現有大型語言模型,以生成具有準確且上下文相關引用的專業學術文章。ScholarCopilot通過生成檢索標記[RET]來動態決定何時檢索學術參考文獻,並利用其表徵從數據庫中查找相關引用。檢索到的參考文獻被輸入模型以增強生成過程。我們在單一框架內聯合優化生成與引用任務,以提高效率。在arXiv的50萬篇論文上訓練後,我們的模型在評估數據集上達到了40.1%的Top-1檢索準確率,超越了如E5-Mistral-7B-Instruct(15.0%)和BM25(9.8%)等基線模型。在1000份學術寫作樣本的數據集上,ScholarCopilot在生成質量(涵蓋相關性、連貫性、學術嚴謹性、完整性和創新性)上獲得16.2/25分,超越了參數量多出10倍的模型如Qwen-2.5-72B-Instruct(15.8/25)。人體研究也證實了ScholarCopilot在引用召回率、寫作效率和整體用戶體驗上的卓越表現,驗證了我們方法的有效性。
English
Academic writing requires both coherent text generation and precise citation
of relevant literature. Although recent Retrieval-Augmented Generation (RAG)
systems have significantly improved factual accuracy in general-purpose text
generation, their capacity to adequately support professional academic writing
remains limited. In this work, we introduce ScholarCopilot, a unified framework
designed to enhance existing large language models for generating professional
academic articles with accurate and contextually relevant citations.
ScholarCopilot dynamically determines when to retrieve scholarly references by
generating a retrieval token [RET], and then utilizes its representation to
look up relevant citations from a database. The retrieved references are fed
into the model to augment the generation process. We jointly optimize both the
generation and citation tasks within a single framework to increase efficiency.
Trained on 500K papers from arXiv, our model achieves a top-1 retrieval
accuracy of 40.1% on our evaluation dataset, outperforming baselines such as
E5-Mistral-7B-Instruct (15.0%) and BM25 (9.8%). On a dataset of 1,000 academic
writing samples, ScholarCopilot scores 16.2/25 in generation quality (measured
across relevance, coherence, academic rigor, completeness, and innovation),
surpassing models with 10x more parameters such as Qwen-2.5-72B-Instruct
(15.8/25). Human studies also confirm ScholarCopilot's superior performance in
citation recall, writing efficiency, and overall user experience, confirming
the effectiveness of our approach.Summary
AI-Generated Summary