ACL-Verbatim：研究用無幻覺問答

摘要

學術研究人員需要有效率且可靠的方法，從可信來源收集高品質資訊，然而現今用於人工智慧輔助研究的工具，仍普遍存在大型語言模型生成事實錯誤或無意義輸出（即所謂幻覺）的傾向。我們將抽取式問答系統 VerbatimRAG 應用於計算語言學協會論文合集中的研究論文，將使用者查詢直接對應至檢索文件中的逐字文字區間。我們為「將使用者查詢對應至研究論文中相關文字區間」此任務貢獻了一個全新的真實標註資料集，並以此訓練與評估多種抽取式模型。人工標註由自然語言處理研究人員執行，基於利用 ScIRGen 方法論所設計之自訂管線生成的合成使用者查詢，並搭配由 VerbatimRAG 檢索而來的研究論文片段。在此基準測試中，一個透過我們管線的銀級監督訓練而成、擁有 1.5 億參數的 ModernBERT 詞元分類器，達到了最佳詞層級 F1 分數（53.6），超越表現最強的評估中大型語言模型抽取器（48.7）。

English

Academic researchers need efficient and reliable methods for collecting high-quality information from trusted sources, but modern tools for AI-assisted research still suffer from the tendency of Large Language Models (LLMs) to produce factually inaccurate or nonsensical output, commonly referred to as hallucinations. We apply the extractive question answering system VerbatimRAG to research papers in the ACL Anthology, directly mapping user queries to verbatim text spans in retrieved documents. We contribute a novel ground truth dataset for the task of mapping user queries to relevant text spans in research papers, and use it to train and evaluate a variety of extractive models. Human annotation is performed by NLP researchers and is based on synthetic user queries generated using a custom pipeline based on the ScIRGen methodology, paired with chunks of research papers retrieved by VerbatimRAG. On this benchmark, a 150M-parameter ModernBERT token classifier trained on silver supervision from our pipeline achieves the best word-level F1 (53.6), ahead of the strongest evaluated LLM extractor (48.7).