위키피디아의 코퍼스 수준 지식 불일치 탐지를 위한 대규모 언어 모델 활용

초록

위키피디아는 전 세계적으로 널리 사용되는 가장 큰 오픈 지식 코퍼스로, 대규모 언어 모델(LLM) 및 검색 강화 생성(RAG) 시스템을 훈련시키는 데 핵심적인 자원으로 활용됩니다. 따라서 그 정확성을 보장하는 것은 매우 중요합니다. 하지만 위키피디아는 얼마나 정확하며, 어떻게 개선할 수 있을까요? 우리는 사실적 부정확성의 특정 유형인 불일치에 초점을 맞추고, 코퍼스 수준의 불일치 탐지 작업을 소개합니다. 우리는 CLAIRE를 제시하는데, 이는 LLM 추론과 검색을 결합하여 잠재적으로 불일치하는 주장과 이를 검토할 수 있는 문맥적 증거를 제시하는 에이전트 시스템입니다. 경험 많은 위키피디아 편집자들을 대상으로 한 사용자 연구에서, 87.5%가 CLAIRE를 사용할 때 더 높은 신뢰도를 보고했으며, 참가자들은 동일한 시간 동안 64.7% 더 많은 불일치를 식별했습니다. CLAIRE와 인간 주석을 결합하여, 우리는 실제 위키피디아 불일치의 첫 번째 벤치마크인 WIKICOLLIDE를 기여합니다. CLAIRE 지원 분석을 통한 무작위 샘플링을 사용하여, 영어 위키피디아 사실의 최소 3.3%가 다른 사실과 모순되며, 이러한 불일치가 FEVEROUS 예제의 7.3%와 AmbigQA 예제의 4.0%로 전파되고 있음을 발견했습니다. 이 데이터셋에서 강력한 베이스라인을 벤치마킹한 결과, 상당한 개선 여지가 있음을 보여줍니다: 가장 성능이 좋은 완전 자동화 시스템은 AUROC가 단 75.1%에 그쳤습니다. 우리의 결과는 모순이 위키피디아의 측정 가능한 구성 요소이며, CLAIRE와 같은 LLM 기반 시스템이 편집자들이 대규모로 지식의 일관성을 개선하는 데 실용적인 도구를 제공할 수 있음을 보여줍니다.

English

Wikipedia is the largest open knowledge corpus, widely used worldwide and serving as a key resource for training large language models (LLMs) and retrieval-augmented generation (RAG) systems. Ensuring its accuracy is therefore critical. But how accurate is Wikipedia, and how can we improve it? We focus on inconsistencies, a specific type of factual inaccuracy, and introduce the task of corpus-level inconsistency detection. We present CLAIRE, an agentic system that combines LLM reasoning with retrieval to surface potentially inconsistent claims along with contextual evidence for human review. In a user study with experienced Wikipedia editors, 87.5% reported higher confidence when using CLAIRE, and participants identified 64.7% more inconsistencies in the same amount of time. Combining CLAIRE with human annotation, we contribute WIKICOLLIDE, the first benchmark of real Wikipedia inconsistencies. Using random sampling with CLAIRE-assisted analysis, we find that at least 3.3% of English Wikipedia facts contradict another fact, with inconsistencies propagating into 7.3% of FEVEROUS and 4.0% of AmbigQA examples. Benchmarking strong baselines on this dataset reveals substantial headroom: the best fully automated system achieves an AUROC of only 75.1%. Our results show that contradictions are a measurable component of Wikipedia and that LLM-based systems like CLAIRE can provide a practical tool to help editors improve knowledge consistency at scale.

위키피디아의 코퍼스 수준 지식 불일치 탐지를 위한 대규모 언어 모델 활용

Detecting Corpus-Level Knowledge Inconsistencies in Wikipedia with Large Language Models

초록

Support