ChatPaper.aiChatPaper

当模型说谎时,我们学习:基于PsiloQA的多语言跨度级幻觉检测

When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with PsiloQA

October 6, 2025
作者: Elisei Rykov, Kseniia Petrushina, Maksim Savkin, Valerii Olisov, Artem Vazhentsev, Kseniia Titova, Alexander Panchenko, Vasily Konovalov, Julia Belikova
cs.AI

摘要

幻觉检测仍是确保大型语言模型(LLMs)安全可靠部署的核心挑战,特别是在要求事实准确性的应用场景中。现有的幻觉基准测试大多局限于序列层面且仅针对英语,缺乏进行全方位评估所需的细粒度、多语言监督。本研究中,我们推出了PsiloQA,这是一个大规模、多语言的数据集,标注了跨越14种语言的片段级幻觉。PsiloQA通过一个自动化的三阶段流程构建:首先利用GPT-4o从维基百科生成问答对,接着在无上下文环境中诱导多种LLMs产生可能包含幻觉的答案,最后通过GPT-4o对比标准答案及检索到的上下文,自动标注出幻觉片段。我们评估了多种幻觉检测方法——包括不确定性量化、基于LLM的标记以及微调编码器模型——结果显示,基于编码器的模型在跨语言环境下表现最为优异。此外,PsiloQA展现了有效的跨语言泛化能力,并支持向其他基准测试的稳健知识迁移,同时其成本效益远高于人工标注的数据集。我们的数据集及研究成果推动了多语言环境下可扩展、细粒度幻觉检测技术的发展。
English
Hallucination detection remains a fundamental challenge for the safe and reliable deployment of large language models (LLMs), especially in applications requiring factual accuracy. Existing hallucination benchmarks often operate at the sequence level and are limited to English, lacking the fine-grained, multilingual supervision needed for a comprehensive evaluation. In this work, we introduce PsiloQA, a large-scale, multilingual dataset annotated with span-level hallucinations across 14 languages. PsiloQA is constructed through an automated three-stage pipeline: generating question-answer pairs from Wikipedia using GPT-4o, eliciting potentially hallucinated answers from diverse LLMs in a no-context setting, and automatically annotating hallucinated spans using GPT-4o by comparing against golden answers and retrieved context. We evaluate a wide range of hallucination detection methods -- including uncertainty quantification, LLM-based tagging, and fine-tuned encoder models -- and show that encoder-based models achieve the strongest performance across languages. Furthermore, PsiloQA demonstrates effective cross-lingual generalization and supports robust knowledge transfer to other benchmarks, all while being significantly more cost-efficient than human-annotated datasets. Our dataset and results advance the development of scalable, fine-grained hallucination detection in multilingual settings.
PDF1065October 17, 2025