從噪聲到敘事：追蹤Transformer模型幻覺的起源

摘要

随着生成式人工智能系统在科学、商业和政府领域日益成熟和普及，深入理解其失效模式已成为当务之急。这些系统行为中偶尔表现出的不稳定性，例如Transformer模型产生幻觉的倾向，阻碍了在高风险领域对新兴AI解决方案的信任与采用。在本研究中，我们通过稀疏自编码器捕捉的概念表征，在实验控制输入空间不确定性的场景下，确立了预训练Transformer模型何时以及如何产生幻觉。我们的系统性实验揭示，随着输入信息愈发非结构化，Transformer模型所使用的语义概念数量随之增加。面对输入空间中不断增长的不确定性，Transformer模型倾向于激活连贯但与输入无关的语义特征，从而导致幻觉输出。在极端情况下，对于纯噪声输入，我们在预训练Transformer模型的中间激活中识别出大量被稳定触发且有意义的概念，并通过定向操控验证了其功能完整性。我们还展示了，Transformer模型输出中的幻觉可以从Transformer层激活中嵌入的概念模式中可靠预测。这一系列关于Transformer内部处理机制的洞见，对于将AI模型与人类价值观对齐、AI安全、揭示潜在对抗攻击的攻击面，以及为模型幻觉风险的自动量化提供基础，具有直接的影响。

English

As generative AI systems become competent and democratized in science, business, and government, deeper insight into their failure modes now poses an acute need. The occasional volatility in their behavior, such as the propensity of transformer models to hallucinate, impedes trust and adoption of emerging AI solutions in high-stakes areas. In the present work, we establish how and when hallucinations arise in pre-trained transformer models through concept representations captured by sparse autoencoders, under scenarios with experimentally controlled uncertainty in the input space. Our systematic experiments reveal that the number of semantic concepts used by the transformer model grows as the input information becomes increasingly unstructured. In the face of growing uncertainty in the input space, the transformer model becomes prone to activate coherent yet input-insensitive semantic features, leading to hallucinated output. At its extreme, for pure-noise inputs, we identify a wide variety of robustly triggered and meaningful concepts in the intermediate activations of pre-trained transformer models, whose functional integrity we confirm through targeted steering. We also show that hallucinations in the output of a transformer model can be reliably predicted from the concept patterns embedded in transformer layer activations. This collection of insights on transformer internal processing mechanics has immediate consequences for aligning AI models with human values, AI safety, opening the attack surface for potential adversarial attacks, and providing a basis for automatic quantification of a model's hallucination risk.

從噪聲到敘事：追蹤Transformer模型幻覺的起源

From Noise to Narrative: Tracing the Origins of Hallucinations in Transformers

摘要

Support