大型語言模型在不同語言中的幻覺程度如何?——實境中多語言LLM幻覺的量化研究
How Much Do LLMs Hallucinate across Languages? On Multilingual Estimation of LLM Hallucination in the Wild
February 18, 2025
作者: Saad Obaid ul Islam, Anne Lauscher, Goran Glavaš
cs.AI
摘要
在資訊誤導的時代,幻覺(hallucination)——大型語言模型(LLMs)生成非事實或不忠實回應的傾向——構成了其全球應用的主要風險。儘管LLMs正變得日益多語言化,但絕大多數關於檢測和量化LLM幻覺的研究仍(a)以英語為中心,且(b)集中於機器翻譯(MT)和摘要生成,這些任務在實際應用中不如開放式資訊尋求常見。與此相對,我們旨在量化LLM在多語言知識密集型長篇問答中的幻覺程度。為此,我們訓練了一個多語言幻覺檢測模型,並對30種語言和6個開源LLM家族進行了大規模研究。我們從一個英語幻覺檢測數據集出發,依賴MT生成其他語言的(帶噪聲的)訓練數據。我們還手動標註了五種高資源語言的黃金數據;隨後我們證明,對於這些語言,幻覺率的估計在銀色(LLM生成)和黃金測試集之間相似,從而驗證了使用銀色數據來估計其他語言幻覺率的有效性。對於最終的幻覺率估計,我們為30種語言構建了一個知識密集型問答數據集,使用LLM生成的提示和維基百科文章作為參考。我們發現,雖然LLM對高資源語言生成的回應更長且包含更多幻覺詞彙,但語言的長度標準化幻覺率與其數字化表徵之間並無關聯。此外,我們發現較小的LLM比大型模型表現出更高的幻覺率。
English
In the age of misinformation, hallucination -- the tendency of Large Language
Models (LLMs) to generate non-factual or unfaithful responses -- represents the
main risk for their global utility. Despite LLMs becoming increasingly
multilingual, the vast majority of research on detecting and quantifying LLM
hallucination are (a) English-centric and (b) focus on machine translation (MT)
and summarization, tasks that are less common ``in the wild'' than open
information seeking. In contrast, we aim to quantify the extent of LLM
hallucination across languages in knowledge-intensive long-form question
answering. To this end, we train a multilingual hallucination detection model
and conduct a large-scale study across 30 languages and 6 open-source LLM
families. We start from an English hallucination detection dataset and rely on
MT to generate (noisy) training data in other languages. We also manually
annotate gold data for five high-resource languages; we then demonstrate, for
these languages, that the estimates of hallucination rates are similar between
silver (LLM-generated) and gold test sets, validating the use of silver data
for estimating hallucination rates for other languages. For the final rates
estimation, we build a knowledge-intensive QA dataset for 30 languages with
LLM-generated prompts and Wikipedia articles as references. We find that, while
LLMs generate longer responses with more hallucinated tokens for
higher-resource languages, there is no correlation between length-normalized
hallucination rates of languages and their digital representation. Further, we
find that smaller LLMs exhibit larger hallucination rates than larger models.Summary
AI-Generated Summary