TofuEval: トピックフォーカス対話要約における大規模言語モデルの幻覚評価

要旨

単一文書ニュース要約においては、近年、事実的一貫性（ファクトチェック）や虚偽生成（ハルシネーション）の評価に関する研究が進み、忠実性の面で大きな進展が見られてきた。本研究では、これらの進歩が他のテキスト要約領域にも適用可能かどうかを検証する。我々は、トピックフォーカス型対話要約における新しい評価ベンチマークを提案し、様々なサイズの大規模言語モデル（LLM）によって生成された要約を対象とする。これらの要約に対して、事実的一貫性に関する二値的な文レベルの人間によるアノテーションを提供し、事実的に不整合な文の詳細な説明を付与する。分析の結果、既存のLLMはモデルのサイズに関わらず、対話領域において多くの事実的誤りを生成することが明らかになった。一方で、GPT-4を含むLLMが二値的な事実評価者として機能する場合、その性能は低く、現行の最先端の専門的な事実性評価指標に劣ることが示された。最後に、我々は精選されたエラータクソノミーを用いてハルシネーションのタイプを分析した。その結果、モデル生成要約には多様なエラーとエラー分布が存在し、非LLMベースの評価指標がLLMベースの評価者よりも全てのエラータイプをより適切に捕捉できることが明らかになった。

English

Single document news summarization has seen substantial progress on faithfulness in recent years, driven by research on the evaluation of factual consistency, or hallucinations. We ask whether these advances carry over to other text summarization domains. We propose a new evaluation benchmark on topic-focused dialogue summarization, generated by LLMs of varying sizes. We provide binary sentence-level human annotations of the factual consistency of these summaries along with detailed explanations of factually inconsistent sentences. Our analysis shows that existing LLMs hallucinate significant amounts of factual errors in the dialogue domain, regardless of the model's size. On the other hand, when LLMs, including GPT-4, serve as binary factual evaluators, they perform poorly and can be outperformed by prevailing state-of-the-art specialized factuality evaluation metrics. Finally, we conducted an analysis of hallucination types with a curated error taxonomy. We find that there are diverse errors and error distributions in model-generated summaries and that non-LLM based metrics can capture all error types better than LLM-based evaluators.

TofuEval: トピックフォーカス対話要約における大規模言語モデルの幻覚評価

TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization

要旨

Support