ChatPaper.aiChatPaper

基于现有基准和更多内容的LLMs作为事实推理者的见解

LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond

May 23, 2023
作者: Philippe Laban, Wojciech Kryściński, Divyansh Agarwal, Alexander R. Fabbri, Caiming Xiong, Shafiq Joty, Chien-Sheng Wu
cs.AI

摘要

随着最近在实际环境中出现大型语言模型(LLMs),拥有能够有效检测事实不一致性的方法对于减少错误信息传播并提高模型输出的信任至关重要。在现有的事实一致性基准测试中,我们发现一些大型语言模型(LLMs)在事实不一致性检测分类基准测试上表现出色,与传统非LLM方法相比。然而,更详细的分析揭示了大多数LLMs在任务更复杂的表述上失败,并暴露了现有评估基准测试存在的问题,影响了评估精度。为了解决这个问题,我们提出了一个新的不一致性检测基准测试创建协议,并在一个名为SummEdits的包含10个领域的基准测试中实施。这个新基准测试每个样本的成本比以前的基准测试低20倍,并且高度可重复,我们估计注释者间的一致性约为0.9。大多数LLMs在SummEdits上表现不佳,性能接近随机选择。表现最佳的模型GPT-4,仍然比估计的人类表现低8\%,突显了LLMs在推理事实和检测不一致性方面的能力差距。
English
With the recent appearance of LLMs in practical settings, having methods that can effectively detect factual inconsistencies is crucial to reduce the propagation of misinformation and improve trust in model outputs. When testing on existing factual consistency benchmarks, we find that a few large language models (LLMs) perform competitively on classification benchmarks for factual inconsistency detection compared to traditional non-LLM methods. However, a closer analysis reveals that most LLMs fail on more complex formulations of the task and exposes issues with existing evaluation benchmarks, affecting evaluation precision. To address this, we propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits. This new benchmark is 20 times more cost-effective per sample than previous benchmarks and highly reproducible, as we estimate inter-annotator agreement at about 0.9. Most LLMs struggle on SummEdits, with performance close to random chance. The best-performing model, GPT-4, is still 8\% below estimated human performance, highlighting the gaps in LLMs' ability to reason about facts and detect inconsistencies when they occur.
PDF21December 15, 2024