LLMは言語を超えてどの程度幻覚を起こすのか？実世界における多言語LLM幻覚推定について

要旨

誤情報が蔓延する時代において、大規模言語モデル（LLM）が非事実的または不正確な応答を生成する傾向である「幻覚（hallucination）」は、その世界的な有用性に対する主要なリスクとなっています。LLMが多言語化する一方で、LLMの幻覚を検出・定量化する研究の大部分は、(a) 英語中心であり、(b) 機械翻訳（MT）や要約といった、オープンな情報探索よりも「実世界」ではあまり一般的でないタスクに焦点を当てています。これに対し、我々は、知識集約型の長文質問応答において、言語間でのLLM幻覚の程度を定量化することを目指しています。そのために、多言語幻覚検出モデルを訓練し、30言語と6つのオープンソースLLMファミリーにわたる大規模な研究を実施しました。まず、英語の幻覚検出データセットを出発点とし、機械翻訳を利用して他の言語での（ノイズの多い）訓練データを生成します。また、5つの高リソース言語に対して手動でゴールドデータを注釈付けし、これらの言語において、幻覚率の推定値がシルバー（LLM生成）テストセットとゴールドテストセットの間で類似していることを示し、他の言語の幻覚率推定にシルバーデータを使用する妥当性を検証します。最終的な幻覚率の推定のために、LLM生成のプロンプトとWikipedia記事を参照として、30言語の知識集約型QAデータセットを構築します。その結果、LLMは高リソース言語に対してより長い応答を生成し、幻覚トークンも多くなるものの、長さ正規化された幻覚率と言語のデジタル表現との間には相関がないことがわかりました。さらに、小規模なLLMは大規模モデルよりも幻覚率が高いことが明らかになりました。

English

In the age of misinformation, hallucination -- the tendency of Large Language Models (LLMs) to generate non-factual or unfaithful responses -- represents the main risk for their global utility. Despite LLMs becoming increasingly multilingual, the vast majority of research on detecting and quantifying LLM hallucination are (a) English-centric and (b) focus on machine translation (MT) and summarization, tasks that are less common ``in the wild'' than open information seeking. In contrast, we aim to quantify the extent of LLM hallucination across languages in knowledge-intensive long-form question answering. To this end, we train a multilingual hallucination detection model and conduct a large-scale study across 30 languages and 6 open-source LLM families. We start from an English hallucination detection dataset and rely on MT to generate (noisy) training data in other languages. We also manually annotate gold data for five high-resource languages; we then demonstrate, for these languages, that the estimates of hallucination rates are similar between silver (LLM-generated) and gold test sets, validating the use of silver data for estimating hallucination rates for other languages. For the final rates estimation, we build a knowledge-intensive QA dataset for 30 languages with LLM-generated prompts and Wikipedia articles as references. We find that, while LLMs generate longer responses with more hallucinated tokens for higher-resource languages, there is no correlation between length-normalized hallucination rates of languages and their digital representation. Further, we find that smaller LLMs exhibit larger hallucination rates than larger models.

LLMは言語を超えてどの程度幻覚を起こすのか？実世界における多言語LLM幻覚推定について

How Much Do LLMs Hallucinate across Languages? On Multilingual Estimation of LLM Hallucination in the Wild

要旨

Support