ACL-Verbatim：面向研究领域的无幻觉问答

摘要

学术研究者需要高效且可靠的方法从可靠来源中收集高质量信息，但当前用于AI辅助研究的现代工具仍存在大语言模型（LLM）生成事实性错误或毫无意义输出的倾向，即所谓的“幻觉”。我们将VerbatimRAG抽取式问答系统应用于ACL Anthology中的研究论文，直接将用户查询映射到检索文档中的原文文本片段。我们为“将用户查询映射至研究论文相关文本片段”这一任务贡献了一个新的真实数据集，并利用该数据集训练和评估了多种抽取式模型。人工标注由NLP研究人员完成，基于我们使用ScIRGen方法定制流程生成的合成用户查询，并与VerbatimRAG检索到的研究论文片段配对。在该基准测试中，一个1.5亿参数的ModernBERT令牌分类器（基于我们流程生成的银级监督训练）取得了最佳词级F1值（53.6），领先于表现最强的LLM抽取器（48.7）。

English

Academic researchers need efficient and reliable methods for collecting high-quality information from trusted sources, but modern tools for AI-assisted research still suffer from the tendency of Large Language Models (LLMs) to produce factually inaccurate or nonsensical output, commonly referred to as hallucinations. We apply the extractive question answering system VerbatimRAG to research papers in the ACL Anthology, directly mapping user queries to verbatim text spans in retrieved documents. We contribute a novel ground truth dataset for the task of mapping user queries to relevant text spans in research papers, and use it to train and evaluate a variety of extractive models. Human annotation is performed by NLP researchers and is based on synthetic user queries generated using a custom pipeline based on the ScIRGen methodology, paired with chunks of research papers retrieved by VerbatimRAG. On this benchmark, a 150M-parameter ModernBERT token classifier trained on silver supervision from our pipeline achieves the best word-level F1 (53.6), ahead of the strongest evaluated LLM extractor (48.7).