ChatPaper.aiChatPaper

基於現有基準和更多洞察,LLM作為事實推理者

LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond

May 23, 2023
作者: Philippe Laban, Wojciech Kryściński, Divyansh Agarwal, Alexander R. Fabbri, Caiming Xiong, Shafiq Joty, Chien-Sheng Wu
cs.AI

摘要

隨著最近在實際應用中出現大型語言模型(LLMs),具備能夠有效檢測事實不一致性的方法至關重要,以減少錯誤資訊的傳播並提高對模型輸出的信任。在現有事實一致性基準測試中,我們發現一些大型語言模型(LLMs)在事實不一致性檢測的分類基準測試上與傳統非LLM方法相比表現競爭力強。然而,更仔細的分析顯示,大多數LLMs在任務的更複雜表述上失敗,並揭示現有評估基準存在問題,影響評估精度。為了解決這個問題,我們提出了一個新的不一致性檢測基準創建協議,並在一個名為SummEdits的包含10個領域的基準測試中實施。這個新的基準測試每個樣本的成本比以前的基準測試節省了20倍,並且高度可重現,我們估計標註者間的一致性約為0.9。大多數LLMs在SummEdits上表現困難,表現接近隨機機會。最佳表現模型GPT-4仍然比估計的人類表現低8%,突顯了LLMs在推理事實和檢測不一致性的能力方面存在的差距。
English
With the recent appearance of LLMs in practical settings, having methods that can effectively detect factual inconsistencies is crucial to reduce the propagation of misinformation and improve trust in model outputs. When testing on existing factual consistency benchmarks, we find that a few large language models (LLMs) perform competitively on classification benchmarks for factual inconsistency detection compared to traditional non-LLM methods. However, a closer analysis reveals that most LLMs fail on more complex formulations of the task and exposes issues with existing evaluation benchmarks, affecting evaluation precision. To address this, we propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits. This new benchmark is 20 times more cost-effective per sample than previous benchmarks and highly reproducible, as we estimate inter-annotator agreement at about 0.9. Most LLMs struggle on SummEdits, with performance close to random chance. The best-performing model, GPT-4, is still 8\% below estimated human performance, highlighting the gaps in LLMs' ability to reason about facts and detect inconsistencies when they occur.
PDF21December 15, 2024