ChatPaper.aiChatPaper

迈向大型语言模型在事实核查中的全面分阶段基准测试

Towards Comprehensive Stage-wise Benchmarking of Large Language Models in Fact-Checking

January 6, 2026
作者: Hongzhan Lin, Zixin Chen, Zhiqi Shen, Ziyang Luo, Zhen Ye, Jing Ma, Tat-Seng Chua, Guandong Xu
cs.AI

摘要

大型语言模型(LLMs)在现实世界事实核查系统中的部署日益增多,然而现有评估主要聚焦于声明验证环节,忽视了包括声明提取与证据检索在内的完整事实核查工作流。这种局限性使当前基准测试难以揭示现代LLMs存在的系统性推理缺陷、事实盲区及鲁棒性局限。为弥补这一空白,我们提出FactArena——一个全自动的竞技场式评估框架,通过分阶段基准测试对LLMs在完整事实核查流程中的表现进行综合评估。FactArena整合三大核心组件:(一)基于LLM驱动的事实核查流程,标准化声明解构、通过工具增强交互实现证据检索、以及基于论证的判定预测;(二)遵循统一参考准则的竞技场式评判机制,确保异构评判代理间进行无偏差且一致的成对比较;(三)竞技场驱动的声明演化模块,能自适应生成语义受控的高难度声明,突破固定种子数据的限制以探测LLMs的事实鲁棒性。在对七大模型家族的16个前沿LLMs进行测试时,FactArena产出稳定且可解释的排名结果。我们的分析进一步揭示了静态声明验证准确率与端到端事实核查能力之间的显著差异,凸显了整体评估的必要性。该框架为诊断LLMs的事实推理能力、指导未来模型开发、以及推动LLMs在安全关键型事实核查应用中的可靠部署,提供了可扩展且可信赖的范式。
English
Large Language Models (LLMs) are increasingly deployed in real-world fact-checking systems, yet existing evaluations focus predominantly on claim verification and overlook the broader fact-checking workflow, including claim extraction and evidence retrieval. This narrow focus prevents current benchmarks from revealing systematic reasoning failures, factual blind spots, and robustness limitations of modern LLMs. To bridge this gap, we present FactArena, a fully automated arena-style evaluation framework that conducts comprehensive, stage-wise benchmarking of LLMs across the complete fact-checking pipeline. FactArena integrates three key components: (i) an LLM-driven fact-checking process that standardizes claim decomposition, evidence retrieval via tool-augmented interactions, and justification-based verdict prediction; (ii) an arena-styled judgment mechanism guided by consolidated reference guidelines to ensure unbiased and consistent pairwise comparisons across heterogeneous judge agents; and (iii) an arena-driven claim-evolution module that adaptively generates more challenging and semantically controlled claims to probe LLMs' factual robustness beyond fixed seed data. Across 16 state-of-the-art LLMs spanning seven model families, FactArena produces stable and interpretable rankings. Our analyses further reveal significant discrepancies between static claim-verification accuracy and end-to-end fact-checking competence, highlighting the necessity of holistic evaluation. The proposed framework offers a scalable and trustworthy paradigm for diagnosing LLMs' factual reasoning, guiding future model development, and advancing the reliable deployment of LLMs in safety-critical fact-checking applications.
PDF11January 15, 2026