TofuEval: 주제 중심 대화 요약에서의 대형 언어 모델 환각 현상 평가

초록

단일 문서 뉴스 요약 분야에서는 최근 사실적 일관성 또는 환각 현상에 대한 평가 연구를 통해 충실도 측면에서 상당한 진전이 이루어졌다. 우리는 이러한 발전이 다른 텍스트 요약 영역으로도 이어지는지 질문한다. 이를 위해 다양한 크기의 대형 언어 모델(LLM)이 생성한 주제 중심 대화 요약에 대한 새로운 평가 벤치마크를 제안한다. 우리는 이러한 요약문의 사실적 일관성에 대한 이진 문장 수준의 인간 주석과 함께 사실적으로 일관되지 않은 문장에 대한 상세한 설명을 제공한다. 우리의 분석에 따르면, 기존 LLM들은 모델의 크기와 관계없이 대화 영역에서 상당량의 사실적 오류를 생성하는 것으로 나타났다. 반면, GPT-4를 포함한 LLM들이 이진 사실 평가자로 사용될 때, 이들은 성능이 저조하며 기존의 최첨단 전문화된 사실성 평가 지표에 뒤처지는 것으로 나타났다. 마지막으로, 우리는 정제된 오류 분류 체계를 통해 환각 유형에 대한 분석을 수행했다. 모델 생성 요약문에는 다양한 오류와 오류 분포가 존재하며, LLM 기반 평가자보다 비-LLM 기반 지표가 모든 오류 유형을 더 잘 포착할 수 있음을 발견했다.

English

Single document news summarization has seen substantial progress on faithfulness in recent years, driven by research on the evaluation of factual consistency, or hallucinations. We ask whether these advances carry over to other text summarization domains. We propose a new evaluation benchmark on topic-focused dialogue summarization, generated by LLMs of varying sizes. We provide binary sentence-level human annotations of the factual consistency of these summaries along with detailed explanations of factually inconsistent sentences. Our analysis shows that existing LLMs hallucinate significant amounts of factual errors in the dialogue domain, regardless of the model's size. On the other hand, when LLMs, including GPT-4, serve as binary factual evaluators, they perform poorly and can be outperformed by prevailing state-of-the-art specialized factuality evaluation metrics. Finally, we conducted an analysis of hallucination types with a curated error taxonomy. We find that there are diverse errors and error distributions in model-generated summaries and that non-LLM based metrics can capture all error types better than LLM-based evaluators.

TofuEval: 주제 중심 대화 요약에서의 대형 언어 모델 환각 현상 평가

TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization

초록

Support