ChatPaper.aiChatPaper

MultiHal:面向知识图谱的多语言数据集,用于大语言模型幻觉的评估

MultiHal: Multilingual Dataset for Knowledge-Graph Grounded Evaluation of LLM Hallucinations

May 20, 2025
作者: Ernests Lavrinovics, Russa Biswas, Katja Hose, Johannes Bjerva
cs.AI

摘要

大型语言模型(LLMs)在忠实性和事实性方面存在固有局限,通常被称为“幻觉”问题。目前已有多个基准测试为英语数据集中的事实性评估提供了测试平台,这些测试依赖于网页链接或文本段落等补充信息,却忽视了现有的结构化事实资源。为此,知识图谱(KGs)被认定为缓解幻觉问题的有效辅助工具,因为它们能以最小语言开销的方式,结构化地呈现实体及其关系的事实。我们针对现有幻觉评估基准中知识图谱路径和多语言性的不足,提出了一个基于知识图谱的多语言、多跳基准测试——MultiHal,专为生成文本评估设计。作为数据收集流程的一部分,我们从开放域知识图谱中挖掘了14万条知识图谱路径,经过筛选去噪,最终精选出2.59万条高质量子集。基线评估显示,在多语言和多模型场景下,KG-RAG相较于传统问答系统,在语义相似度评分上实现了约0.12至0.36分的绝对提升,充分展现了知识图谱整合的潜力。我们期待MultiHal能推动未来在基于图结构的幻觉缓解与事实核查任务上的研究进展。
English
Large Language Models (LLMs) have inherent limitations of faithfulness and factuality, commonly referred to as hallucinations. Several benchmarks have been developed that provide a test bed for factuality evaluation within the context of English-centric datasets, while relying on supplementary informative context like web links or text passages but ignoring the available structured factual resources. To this end, Knowledge Graphs (KGs) have been identified as a useful aid for hallucination mitigation, as they provide a structured way to represent the facts about entities and their relations with minimal linguistic overhead. We bridge the lack of KG paths and multilinguality for factual language modeling within the existing hallucination evaluation benchmarks and propose a KG-based multilingual, multihop benchmark called MultiHal framed for generative text evaluation. As part of our data collection pipeline, we mined 140k KG-paths from open-domain KGs, from which we pruned noisy KG-paths, curating a high-quality subset of 25.9k. Our baseline evaluation shows an absolute scale increase by approximately 0.12 to 0.36 points for the semantic similarity score in KG-RAG over vanilla QA across multiple languages and multiple models, demonstrating the potential of KG integration. We anticipate MultiHal will foster future research towards several graph-based hallucination mitigation and fact-checking tasks.

Summary

AI-Generated Summary

PDF12May 22, 2025