语言模型幻觉如何产生连锁效应

摘要

在实际应用中使用语言模型的一个主要风险是它们倾向于产生错误陈述的幻觉。幻觉通常被归因于语言模型中的知识缺口，但我们假设在某些情况下，当语言模型为先前生成的幻觉提供理由时，它们会输出错误声明，而这些声明它们可以单独识别为错误。我们构建了三个问答数据集，其中ChatGPT和GPT-4经常给出错误答案，并提供至少一个错误声明的解释。重要的是，我们发现ChatGPT和GPT-4分别能够识别出自己错误的67%和87%。我们将这一现象称为幻觉滚雪球效应：语言模型对早期错误过度承诺，导致产生更多本不会出现的错误。

English

A major risk of using language models in practical applications is their tendency to hallucinate incorrect statements. Hallucinations are often attributed to knowledge gaps in LMs, but we hypothesize that in some cases, when justifying previously generated hallucinations, LMs output false claims that they can separately recognize as incorrect. We construct three question-answering datasets where ChatGPT and GPT-4 often state an incorrect answer and offer an explanation with at least one incorrect claim. Crucially, we find that ChatGPT and GPT-4 can identify 67% and 87% of their own mistakes, respectively. We refer to this phenomenon as hallucination snowballing: an LM over-commits to early mistakes, leading to more mistakes that it otherwise would not make.

语言模型幻觉如何产生连锁效应

How Language Model Hallucinations Can Snowball

摘要

Support