ChatPaper.aiChatPaper

TofuEval:评估LLM在面向主题的对话摘要中的幻觉

TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization

February 20, 2024
作者: Liyan Tang, Igor Shalyminov, Amy Wing-mei Wong, Jon Burnsky, Jake W. Vincent, Yu'an Yang, Siffi Singh, Song Feng, Hwanjun Song, Hang Su, Lijia Sun, Yi Zhang, Saab Mansour, Kathleen McKeown
cs.AI

摘要

最近几年,单文档新闻摘要在忠实性方面取得了实质性进展,这得益于对事实一致性或幻觉评估的研究。我们探讨这些进展是否能延伸到其他文本摘要领域。我们提出了一个新的评估基准,针对以主题为中心的对话摘要,这些摘要是由不同规模的LLM生成的。我们提供了关于这些摘要的事实一致性的二元句级人工注释,以及对事实不一致句子的详细解释。我们的分析表明,现有的LLM在对话领域中产生了大量事实错误的幻觉,无论模型的规模如何。另一方面,当包括GPT-4在内的LLM充当二元事实评估器时,它们表现不佳,并且在事实性评估方面被当前最先进的专门事实性评估指标超越。最后,我们使用经过精心策划的错误分类法对幻觉类型进行了分析。我们发现模型生成的摘要中存在各种错误和错误分布,并且非LLM基于的指标能够更好地捕捉所有错误类型,胜过LLM基于的评估器。
English
Single document news summarization has seen substantial progress on faithfulness in recent years, driven by research on the evaluation of factual consistency, or hallucinations. We ask whether these advances carry over to other text summarization domains. We propose a new evaluation benchmark on topic-focused dialogue summarization, generated by LLMs of varying sizes. We provide binary sentence-level human annotations of the factual consistency of these summaries along with detailed explanations of factually inconsistent sentences. Our analysis shows that existing LLMs hallucinate significant amounts of factual errors in the dialogue domain, regardless of the model's size. On the other hand, when LLMs, including GPT-4, serve as binary factual evaluators, they perform poorly and can be outperformed by prevailing state-of-the-art specialized factuality evaluation metrics. Finally, we conducted an analysis of hallucination types with a curated error taxonomy. We find that there are diverse errors and error distributions in model-generated summaries and that non-LLM based metrics can capture all error types better than LLM-based evaluators.

Summary

AI-Generated Summary

PDF134December 15, 2024