从噪声到叙事：追踪Transformer模型中幻觉现象的起源

摘要

随着生成式AI系统在科学、商业和政府领域的日益普及与能力提升，深入理解其失效模式已成为当务之急。这些系统行为中偶发的波动性，如Transformer模型易产生幻觉的倾向，阻碍了高风险领域对新兴AI解决方案的信任与采纳。在本研究中，我们通过稀疏自编码器捕捉的概念表征，在输入空间不确定性受控的实验场景下，确立了预训练Transformer模型何时及如何产生幻觉。系统性实验揭示，随着输入信息愈发非结构化，Transformer模型所使用的语义概念数量随之增加。面对输入空间不确定性的增长，Transformer模型倾向于激活连贯但与输入无关的语义特征，从而导致幻觉输出。极端情况下，对于纯噪声输入，我们在预训练Transformer模型的中间激活中识别出大量被稳定触发且有意义的概念，并通过定向引导验证了其功能完整性。我们还展示了，Transformer模型输出中的幻觉能够可靠地通过Transformer层激活中嵌入的概念模式进行预测。这一系列关于Transformer内部处理机制的洞见，对于使AI模型与人类价值观对齐、AI安全、潜在对抗攻击面的开启，以及为模型幻觉风险的自动量化提供基础，均具有直接意义。

English

As generative AI systems become competent and democratized in science, business, and government, deeper insight into their failure modes now poses an acute need. The occasional volatility in their behavior, such as the propensity of transformer models to hallucinate, impedes trust and adoption of emerging AI solutions in high-stakes areas. In the present work, we establish how and when hallucinations arise in pre-trained transformer models through concept representations captured by sparse autoencoders, under scenarios with experimentally controlled uncertainty in the input space. Our systematic experiments reveal that the number of semantic concepts used by the transformer model grows as the input information becomes increasingly unstructured. In the face of growing uncertainty in the input space, the transformer model becomes prone to activate coherent yet input-insensitive semantic features, leading to hallucinated output. At its extreme, for pure-noise inputs, we identify a wide variety of robustly triggered and meaningful concepts in the intermediate activations of pre-trained transformer models, whose functional integrity we confirm through targeted steering. We also show that hallucinations in the output of a transformer model can be reliably predicted from the concept patterns embedded in transformer layer activations. This collection of insights on transformer internal processing mechanics has immediate consequences for aligning AI models with human values, AI safety, opening the attack surface for potential adversarial attacks, and providing a basis for automatic quantification of a model's hallucination risk.

从噪声到叙事：追踪Transformer模型中幻觉现象的起源

From Noise to Narrative: Tracing the Origins of Hallucinations in Transformers

摘要

Support