ChatPaper.aiChatPaper

大型语言模型在知识图谱验证中的基准测试研究

Benchmarking Large Language Models for Knowledge Graph Validation

February 11, 2026
作者: Farzad Shami, Stefano Marchesin, Gianmaria Silvello
cs.AI

摘要

知识图谱(KGs)通过关系连接实体来存储结构化事实知识,对众多应用至关重要。这些应用依赖知识图谱的事实准确性,因此事实验证虽具挑战性却必不可少。专家人工验证虽理想但难以大规模实施。自动化方法展现出潜力,但尚未达到实际应用要求。大语言模型(LLMs)凭借其语义理解与知识获取能力具有潜力,但其在知识图谱事实验证中的适用性与有效性仍待探索。 本文提出FactCheck基准测试,旨在从三个关键维度评估LLMs的图谱事实验证能力:(1)LLMs内部知识;(2)基于检索增强生成(RAG)的外部证据;(3)采用多模型共识策略的聚合知识。我们在三个真实世界知识图谱上评估了开源与商业LLMs。FactCheck还包含专为图谱事实验证定制的RAG数据集,涵盖逾200万份文档,同时提供可交互的分析平台用于验证决策研究。 实验分析表明:尽管LLMs展现出可喜成果,但其稳定性与可靠性仍不足以支撑实际应用场景。通过RAG方法整合外部证据会导致性能波动,相比精简方法虽偶有提升但效果不稳定且计算成本更高。多模型共识策略同样无法持续优于单一模型,凸显出通用解决方案的缺失。这些发现进一步印证了像FactCheck这样的基准测试对于系统评估并推动这一关键难题取得进展的必要性。
English
Knowledge Graphs (KGs) store structured factual knowledge by linking entities through relationships, crucial for many applications. These applications depend on the KG's factual accuracy, so verifying facts is essential, yet challenging. Expert manual verification is ideal but impractical on a large scale. Automated methods show promise but are not ready for real-world KGs. Large Language Models (LLMs) offer potential with their semantic understanding and knowledge access, yet their suitability and effectiveness for KG fact validation remain largely unexplored. In this paper, we introduce FactCheck, a benchmark designed to evaluate LLMs for KG fact validation across three key dimensions: (1) LLMs internal knowledge; (2) external evidence via Retrieval-Augmented Generation (RAG); and (3) aggregated knowledge employing a multi-model consensus strategy. We evaluated open-source and commercial LLMs on three diverse real-world KGs. FactCheck also includes a RAG dataset with 2+ million documents tailored for KG fact validation. Additionally, we offer an interactive exploration platform for analyzing verification decisions. The experimental analyses demonstrate that while LLMs yield promising results, they are still not sufficiently stable and reliable to be used in real-world KG validation scenarios. Integrating external evidence through RAG methods yields fluctuating performance, providing inconsistent improvements over more streamlined approaches -- at higher computational costs. Similarly, strategies based on multi-model consensus do not consistently outperform individual models, underscoring the lack of a one-fits-all solution. These findings further emphasize the need for a benchmark like FactCheck to systematically evaluate and drive progress on this difficult yet crucial task.
PDF41February 13, 2026