言語モデルの幻覚が雪だるま式に拡大する仕組み

要旨

実用アプリケーションにおいて言語モデルを使用する際の主要なリスクは、誤った記述を幻覚（hallucination）として生成する傾向にある。幻覚はしばしば言語モデルの知識不足に起因するとされるが、我々は、以前に生成した幻覚を正当化する際に、言語モデルが個別には誤りと認識できる虚偽の主張を出力する場合があると仮説を立てた。我々は、ChatGPTとGPT-4が誤った回答を述べ、少なくとも1つの誤った主張を含む説明を提供することが多い3つの質問応答データセットを構築した。重要なことに、ChatGPTとGPT-4はそれぞれ自身の誤りの67％と87％を識別できることがわかった。我々はこの現象を「幻覚の雪だるま効果（hallucination snowballing）」と呼ぶ：言語モデルが初期の誤りに過剰に固執することで、本来ならば起こらないはずのさらなる誤りを引き起こす現象である。

English

A major risk of using language models in practical applications is their tendency to hallucinate incorrect statements. Hallucinations are often attributed to knowledge gaps in LMs, but we hypothesize that in some cases, when justifying previously generated hallucinations, LMs output false claims that they can separately recognize as incorrect. We construct three question-answering datasets where ChatGPT and GPT-4 often state an incorrect answer and offer an explanation with at least one incorrect claim. Crucially, we find that ChatGPT and GPT-4 can identify 67% and 87% of their own mistakes, respectively. We refer to this phenomenon as hallucination snowballing: an LM over-commits to early mistakes, leading to more mistakes that it otherwise would not make.

言語モデルの幻覚が雪だるま式に拡大する仕組み

How Language Model Hallucinations Can Snowball

要旨

Support