사실 추론자로서의 대형 언어 모델: 기존 벤치마크를 넘어선 통찰

초록

최근 실용적인 환경에서 대형 언어 모델(LLMs)이 등장함에 따라, 사실적 불일치를 효과적으로 탐지할 수 있는 방법을 갖추는 것은 오정보의 확산을 줄이고 모델 출력에 대한 신뢰를 향상시키는 데 중요합니다. 기존의 사실적 일관성 벤치마크에서 테스트할 때, 몇몇 대형 언어 모델(LLMs)이 전통적인 비-LLM 방법에 비해 사실적 불일치 탐지를 위한 분류 벤치마크에서 경쟁력 있는 성능을 보이는 것을 확인했습니다. 그러나 보다 심층적인 분석 결과, 대부분의 LLMs이 더 복잡한 형태의 작업에서 실패하며 기존 평가 벤치마크의 문제점이 노출되어 평가 정밀도에 영향을 미치는 것으로 나타났습니다. 이를 해결하기 위해, 우리는 불일치 탐지 벤치마크 생성에 대한 새로운 프로토콜을 제안하고 이를 10개 도메인 벤치마크인 SummEdits에 구현했습니다. 이 새로운 벤치마크는 이전 벤치마크 대비 샘플당 20배 더 비용 효율적이며, 주석자 간 일치율을 약 0.9로 추정하여 높은 재현성을 보입니다. 대부분의 LLMs이 SummEdits에서 어려움을 겪으며, 성능이 무작위 선택에 가까운 수준입니다. 가장 성능이 뛰어난 모델인 GPT-4조차도 추정된 인간 성능보다 8% 낮은 성능을 보여, LLMs이 사실을 추론하고 불일치를 탐지하는 능력에 있어 여전히 격차가 있음을 강조합니다.

English

With the recent appearance of LLMs in practical settings, having methods that can effectively detect factual inconsistencies is crucial to reduce the propagation of misinformation and improve trust in model outputs. When testing on existing factual consistency benchmarks, we find that a few large language models (LLMs) perform competitively on classification benchmarks for factual inconsistency detection compared to traditional non-LLM methods. However, a closer analysis reveals that most LLMs fail on more complex formulations of the task and exposes issues with existing evaluation benchmarks, affecting evaluation precision. To address this, we propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits. This new benchmark is 20 times more cost-effective per sample than previous benchmarks and highly reproducible, as we estimate inter-annotator agreement at about 0.9. Most LLMs struggle on SummEdits, with performance close to random chance. The best-performing model, GPT-4, is still 8\% below estimated human performance, highlighting the gaps in LLMs' ability to reason about facts and detect inconsistencies when they occur.

사실 추론자로서의 대형 언어 모델: 기존 벤치마크를 넘어선 통찰

LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond

초록

Support