ChatPaper.aiChatPaper

MultiHal:用於基於知識圖譜的大語言模型幻覺評估之多語言數據集

MultiHal: Multilingual Dataset for Knowledge-Graph Grounded Evaluation of LLM Hallucinations

May 20, 2025
作者: Ernests Lavrinovics, Russa Biswas, Katja Hose, Johannes Bjerva
cs.AI

摘要

大型语言模型(LLMs)在忠实性和事实性方面存在固有的局限性,通常被称为“幻觉”。目前已经开发了多个基准测试,为英语数据集背景下的事实性评估提供了测试平台,这些测试依赖于如网页链接或文本段落等补充信息,却忽略了现有的结构化事实资源。为此,知识图谱(KGs)被认定为缓解幻觉现象的有力工具,因为它们以结构化的方式呈现实体及其关系的事实,且语言开销最小。我们针对现有幻觉评估基准中缺乏知识图谱路径和多语言性的问题,提出了一个基于知识图谱的多语言、多跳基准测试——MultiHal,专为生成文本评估设计。作为数据收集流程的一部分,我们从开放域知识图谱中挖掘了14万条知识图谱路径,经过筛选去噪,最终精选出2.59万条高质量子集。基线评估显示,在多种语言和多个模型中,KG-RAG相较于普通问答的语义相似度得分绝对提升了约0.12至0.36分,这证明了知识图谱整合的潜力。我们预期MultiHal将推动未来在基于图表的幻觉缓解和事实核查任务方面的研究。
English
Large Language Models (LLMs) have inherent limitations of faithfulness and factuality, commonly referred to as hallucinations. Several benchmarks have been developed that provide a test bed for factuality evaluation within the context of English-centric datasets, while relying on supplementary informative context like web links or text passages but ignoring the available structured factual resources. To this end, Knowledge Graphs (KGs) have been identified as a useful aid for hallucination mitigation, as they provide a structured way to represent the facts about entities and their relations with minimal linguistic overhead. We bridge the lack of KG paths and multilinguality for factual language modeling within the existing hallucination evaluation benchmarks and propose a KG-based multilingual, multihop benchmark called MultiHal framed for generative text evaluation. As part of our data collection pipeline, we mined 140k KG-paths from open-domain KGs, from which we pruned noisy KG-paths, curating a high-quality subset of 25.9k. Our baseline evaluation shows an absolute scale increase by approximately 0.12 to 0.36 points for the semantic similarity score in KG-RAG over vanilla QA across multiple languages and multiple models, demonstrating the potential of KG integration. We anticipate MultiHal will foster future research towards several graph-based hallucination mitigation and fact-checking tasks.

Summary

AI-Generated Summary

PDF12May 22, 2025