ACL-Verbatim: 研究のための幻覚フリー質問応答

要旨

学術研究者は、信頼できる情報源から高品質な情報を収集するための効率的かつ信頼性の高い手法を必要としている。しかし、現代のAI支援研究ツールは、大規模言語モデル（LLM）が事実と異なる、あるいは無意味な出力を生成する傾向（一般にハルシネーションと呼ばれる）に依然として悩まされている。本研究では、抽出型質問応答システムVerbatimRAGをACLアンソロジーの研究論文に適用し、ユーザーのクエリを取得された文書中の逐語的なテキスパンに直接マッピングする。また、ユーザーのクエリを研究論文の関連テキスパンにマッピングするタスクのための新たな正解データセットを提供し、それを用いて様々な抽出モデルを訓練・評価する。人間によるアノテーションは、NLP研究者によって実施され、ScIRGen手法に基づくカスタムパイプラインを用いて生成された合成ユーザークエリと、VerbatimRAGによって取得された研究論文のチャンクとの組み合わせに基づいている。このベンチマークにおいて、我々のパイプラインからの銀ラベルを用いた教師信号で訓練された150MパラメータのModernBERTトークン分類器は、単語レベルのF1で最高値（53.6）を達成し、最も強力な評価対象LLM抽出器（48.7）を上回った。

English

Academic researchers need efficient and reliable methods for collecting high-quality information from trusted sources, but modern tools for AI-assisted research still suffer from the tendency of Large Language Models (LLMs) to produce factually inaccurate or nonsensical output, commonly referred to as hallucinations. We apply the extractive question answering system VerbatimRAG to research papers in the ACL Anthology, directly mapping user queries to verbatim text spans in retrieved documents. We contribute a novel ground truth dataset for the task of mapping user queries to relevant text spans in research papers, and use it to train and evaluate a variety of extractive models. Human annotation is performed by NLP researchers and is based on synthetic user queries generated using a custom pipeline based on the ScIRGen methodology, paired with chunks of research papers retrieved by VerbatimRAG. On this benchmark, a 150M-parameter ModernBERT token classifier trained on silver supervision from our pipeline achieves the best word-level F1 (53.6), ahead of the strongest evaluated LLM extractor (48.7).