基于现有基准和更多内容的LLMs作为事实推理者的见解
LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond
May 23, 2023
作者: Philippe Laban, Wojciech Kryściński, Divyansh Agarwal, Alexander R. Fabbri, Caiming Xiong, Shafiq Joty, Chien-Sheng Wu
cs.AI
摘要
随着最近在实际环境中出现大型语言模型(LLMs),拥有能够有效检测事实不一致性的方法对于减少错误信息传播并提高模型输出的信任至关重要。在现有的事实一致性基准测试中,我们发现一些大型语言模型(LLMs)在事实不一致性检测分类基准测试上表现出色,与传统非LLM方法相比。然而,更详细的分析揭示了大多数LLMs在任务更复杂的表述上失败,并暴露了现有评估基准测试存在的问题,影响了评估精度。为了解决这个问题,我们提出了一个新的不一致性检测基准测试创建协议,并在一个名为SummEdits的包含10个领域的基准测试中实施。这个新基准测试每个样本的成本比以前的基准测试低20倍,并且高度可重复,我们估计注释者间的一致性约为0.9。大多数LLMs在SummEdits上表现不佳,性能接近随机选择。表现最佳的模型GPT-4,仍然比估计的人类表现低8\%,突显了LLMs在推理事实和检测不一致性方面的能力差距。
English
With the recent appearance of LLMs in practical settings, having methods that
can effectively detect factual inconsistencies is crucial to reduce the
propagation of misinformation and improve trust in model outputs. When testing
on existing factual consistency benchmarks, we find that a few large language
models (LLMs) perform competitively on classification benchmarks for factual
inconsistency detection compared to traditional non-LLM methods. However, a
closer analysis reveals that most LLMs fail on more complex formulations of the
task and exposes issues with existing evaluation benchmarks, affecting
evaluation precision. To address this, we propose a new protocol for
inconsistency detection benchmark creation and implement it in a 10-domain
benchmark called SummEdits. This new benchmark is 20 times more cost-effective
per sample than previous benchmarks and highly reproducible, as we estimate
inter-annotator agreement at about 0.9. Most LLMs struggle on SummEdits, with
performance close to random chance. The best-performing model, GPT-4, is still
8\% below estimated human performance, highlighting the gaps in LLMs' ability
to reason about facts and detect inconsistencies when they occur.