ACL-Verbatim: 연구를 위한 할루시네이션 없는 질문 응답

초록

학술 연구진은 신뢰할 수 있는 출처로부터 고품질 정보를 수집하기 위한 효율적이고 신뢰할 수 있는 방법을 필요로 하지만, 현대의 AI 지원 연구 도구는 대규모 언어 모델(LLM)이 사실적으로 부정확하거나 무의미한 출력을 생성하는 경향, 즉 일반적으로 환각이라고 불리는 문제를 여전히 겪고 있다. 우리는 추출적 질의응답 시스템인 VerbatimRAG를 ACL 앤솔로지의 연구 논문에 적용하여, 사용자 질의를 검색된 문서 내의 그대로의 텍스트 범위로 직접 매핑한다. 우리는 연구 논문에서 사용자 질의를 관련 텍스트 범위로 매핑하는 작업을 위한 새로운 정답 데이터셋을 구축하고, 이를 사용하여 다양한 추출 모델을 훈련하고 평가한다. 인간 주석은 NLP 연구자에 의해 수행되며, ScIRGen 방법론에 기반한 맞춤형 파이프라인을 사용하여 생성된 합성 사용자 질의와 VerbatimRAG가 검색한 연구 논문 청크를 쌍으로 사용한다. 이 벤치마크에서, 우리 파이프라인의 은색 지도를 통해 훈련된 1억 5천만 파라미터 ModernBERT 토큰 분류기가 최고의 단어 수준 F1 점수(53.6)를 달성하여, 평가된 가장 강력한 LLM 추출기(48.7)를 앞질렀다.

English

Academic researchers need efficient and reliable methods for collecting high-quality information from trusted sources, but modern tools for AI-assisted research still suffer from the tendency of Large Language Models (LLMs) to produce factually inaccurate or nonsensical output, commonly referred to as hallucinations. We apply the extractive question answering system VerbatimRAG to research papers in the ACL Anthology, directly mapping user queries to verbatim text spans in retrieved documents. We contribute a novel ground truth dataset for the task of mapping user queries to relevant text spans in research papers, and use it to train and evaluate a variety of extractive models. Human annotation is performed by NLP researchers and is based on synthetic user queries generated using a custom pipeline based on the ScIRGen methodology, paired with chunks of research papers retrieved by VerbatimRAG. On this benchmark, a 150M-parameter ModernBERT token classifier trained on silver supervision from our pipeline achieves the best word-level F1 (53.6), ahead of the strongest evaluated LLM extractor (48.7).