ChatPaper.aiChatPaper

利用大型语言模型检测维基百科中的语料库级知识不一致性

Detecting Corpus-Level Knowledge Inconsistencies in Wikipedia with Large Language Models

September 27, 2025
作者: Sina J. Semnani, Jirayu Burapacheep, Arpandeep Khatua, Thanawan Atchariyachanvanit, Zheng Wang, Monica S. Lam
cs.AI

摘要

维基百科作为全球最大的开放知识库,被广泛使用,并成为训练大型语言模型(LLMs)和检索增强生成(RAG)系统的关键资源。因此,确保其准确性至关重要。然而,维基百科的准确度究竟如何,我们又该如何提升它呢? 我们聚焦于不一致性这一特定类型的事实错误,并引入了语料库级别的不一致性检测任务。我们提出了CLAIRE,一个结合了LLM推理与检索的智能系统,旨在揭示潜在的不一致声明,并提供上下文证据供人工审查。在一项有经验的维基百科编辑参与的用户研究中,87.5%的参与者表示使用CLAIRE后信心增强,且在同一时间内,参与者识别出的不一致性增加了64.7%。 通过将CLAIRE与人工标注相结合,我们贡献了WIKICOLLIDE,这是首个真实维基百科不一致性的基准测试集。利用CLAIRE辅助的随机抽样分析,我们发现至少3.3%的英文维基百科事实与其他事实相矛盾,这些不一致性进一步影响了7.3%的FEVEROUS和4.0%的AmbigQA示例。在该数据集上对强基线模型进行基准测试,显示出显著的提升空间:最佳全自动化系统的AUROC仅为75.1%。 我们的研究结果表明,矛盾是维基百科中可量化的组成部分,而基于LLM的系统如CLAIRE,能够为编辑人员提供实用工具,助力大规模提升知识一致性。
English
Wikipedia is the largest open knowledge corpus, widely used worldwide and serving as a key resource for training large language models (LLMs) and retrieval-augmented generation (RAG) systems. Ensuring its accuracy is therefore critical. But how accurate is Wikipedia, and how can we improve it? We focus on inconsistencies, a specific type of factual inaccuracy, and introduce the task of corpus-level inconsistency detection. We present CLAIRE, an agentic system that combines LLM reasoning with retrieval to surface potentially inconsistent claims along with contextual evidence for human review. In a user study with experienced Wikipedia editors, 87.5% reported higher confidence when using CLAIRE, and participants identified 64.7% more inconsistencies in the same amount of time. Combining CLAIRE with human annotation, we contribute WIKICOLLIDE, the first benchmark of real Wikipedia inconsistencies. Using random sampling with CLAIRE-assisted analysis, we find that at least 3.3% of English Wikipedia facts contradict another fact, with inconsistencies propagating into 7.3% of FEVEROUS and 4.0% of AmbigQA examples. Benchmarking strong baselines on this dataset reveals substantial headroom: the best fully automated system achieves an AUROC of only 75.1%. Our results show that contradictions are a measurable component of Wikipedia and that LLM-based systems like CLAIRE can provide a practical tool to help editors improve knowledge consistency at scale.
PDF11September 30, 2025