ChatPaper.aiChatPaper

当模型说谎时,我们学习:基于PsiloQA的多语言跨度级幻觉检测

When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with PsiloQA

October 6, 2025
作者: Elisei Rykov, Kseniia Petrushina, Maksim Savkin, Valerii Olisov, Artem Vazhentsev, Kseniia Titova, Alexander Panchenko, Vasily Konovalov, Julia Belikova
cs.AI

摘要

幻覺檢測仍然是確保大型語言模型(LLMs)安全可靠部署的基本挑戰,尤其是在要求事實準確性的應用中。現有的幻覺基準測試通常僅在序列層面操作,且僅限於英語,缺乏全面評估所需的細粒度、多語言監督。在本研究中,我們引入了PsiloQA,這是一個大規模、多語言的數據集,涵蓋14種語言,並在片段層面標註了幻覺。PsiloQA通過一個自動化的三階段流程構建:使用GPT-4o從維基百科生成問答對,在無上下文設置下從多樣化的LLMs中誘導出可能包含幻覺的答案,並通過與黃金答案和檢索到的上下文進行比較,使用GPT-4o自動標註幻覺片段。我們評估了多種幻覺檢測方法——包括不確定性量化、基於LLM的標記和微調的編碼器模型——並顯示基於編碼器的模型在所有語言中表現最強。此外,PsiloQA展示了有效的跨語言泛化能力,並支持向其他基準測試的穩健知識轉移,同時在成本效益上顯著優於人工標註的數據集。我們的數據集和結果推動了在多語言環境中可擴展、細粒度幻覺檢測的發展。
English
Hallucination detection remains a fundamental challenge for the safe and reliable deployment of large language models (LLMs), especially in applications requiring factual accuracy. Existing hallucination benchmarks often operate at the sequence level and are limited to English, lacking the fine-grained, multilingual supervision needed for a comprehensive evaluation. In this work, we introduce PsiloQA, a large-scale, multilingual dataset annotated with span-level hallucinations across 14 languages. PsiloQA is constructed through an automated three-stage pipeline: generating question-answer pairs from Wikipedia using GPT-4o, eliciting potentially hallucinated answers from diverse LLMs in a no-context setting, and automatically annotating hallucinated spans using GPT-4o by comparing against golden answers and retrieved context. We evaluate a wide range of hallucination detection methods -- including uncertainty quantification, LLM-based tagging, and fine-tuned encoder models -- and show that encoder-based models achieve the strongest performance across languages. Furthermore, PsiloQA demonstrates effective cross-lingual generalization and supports robust knowledge transfer to other benchmarks, all while being significantly more cost-efficient than human-annotated datasets. Our dataset and results advance the development of scalable, fine-grained hallucination detection in multilingual settings.
PDF1065October 17, 2025