TofuEval:評估以主題為焦點的對話摘要中LLM的幻覺
TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization
February 20, 2024
作者: Liyan Tang, Igor Shalyminov, Amy Wing-mei Wong, Jon Burnsky, Jake W. Vincent, Yu'an Yang, Siffi Singh, Song Feng, Hwanjun Song, Hang Su, Lijia Sun, Yi Zhang, Saab Mansour, Kathleen McKeown
cs.AI
摘要
近年來,單一文件新聞摘要在忠實性方面取得了顯著進展,這是由對事實一致性或幻覺評估的研究推動的。我們詢問這些進展是否能擴展到其他文本摘要領域。我們提出了一個新的評估基準,針對以主題為焦點的對話摘要,這些摘要是由不同大小的LLMs生成的。我們提供了有關這些摘要的事實一致性的二元句級人工標註,以及對事實不一致句子的詳細解釋。我們的分析顯示,現有的LLMs在對話領域中幻覺出大量事實錯誤,無論模型大小如何。另一方面,當包括GPT-4在內的LLMs充當二元事實評估者時,它們表現不佳,並且可以被現有的最先進的專門事實評估指標超越。最後,我們使用經過精心選擇的錯誤分類法對幻覺類型進行了分析。我們發現模型生成的摘要中存在各種錯誤和錯誤分佈,非LLM基礎的指標可以更好地捕捉所有錯誤類型,勝過LLM基礎的評估者。
English
Single document news summarization has seen substantial progress on
faithfulness in recent years, driven by research on the evaluation of factual
consistency, or hallucinations. We ask whether these advances carry over to
other text summarization domains. We propose a new evaluation benchmark on
topic-focused dialogue summarization, generated by LLMs of varying sizes. We
provide binary sentence-level human annotations of the factual consistency of
these summaries along with detailed explanations of factually inconsistent
sentences. Our analysis shows that existing LLMs hallucinate significant
amounts of factual errors in the dialogue domain, regardless of the model's
size. On the other hand, when LLMs, including GPT-4, serve as binary factual
evaluators, they perform poorly and can be outperformed by prevailing
state-of-the-art specialized factuality evaluation metrics. Finally, we
conducted an analysis of hallucination types with a curated error taxonomy. We
find that there are diverse errors and error distributions in model-generated
summaries and that non-LLM based metrics can capture all error types better
than LLM-based evaluators.Summary
AI-Generated Summary