ノイズから物語へ：トランスフォーマーにおける幻覚の起源をたどる

要旨

生成AIシステムが科学、ビジネス、政府の分野で有能かつ民主化されるにつれ、その失敗モードに対する深い洞察が急務となっています。トランスフォーマーモデルが幻覚を起こしやすいといった、その振る舞いの時折の不安定性は、高リスク領域での新興AIソリューションの信頼と採用を妨げています。本研究では、事前学習済みトランスフォーマーモデルにおいて、入力空間の不確実性を実験的に制御したシナリオ下で、スパースオートエンコーダによって捕捉された概念表現を通じて、幻覚がどのように、いつ発生するかを明らかにします。体系的な実験により、トランスフォーマーモデルが使用する意味概念の数が、入力情報がますます非構造化されるにつれて増加することが明らかになりました。入力空間の不確実性が高まるにつれて、トランスフォーマーモデルは一貫性があるものの入力に鈍感な意味特徴を活性化しやすくなり、幻覚的な出力を引き起こします。極端な場合、純粋なノイズ入力に対して、事前学習済みトランスフォーマーモデルの中間活性化において、多様で堅牢にトリガーされる有意義な概念を特定し、その機能的な整合性をターゲットを絞ったステアリングによって確認します。また、トランスフォーマーモデルの出力における幻覚が、トランスフォーマー層の活性化に埋め込まれた概念パターンから確実に予測できることも示します。トランスフォーマーの内部処理メカニズムに関するこれらの洞察は、AIモデルを人間の価値観に整合させること、AIの安全性、潜在的な敵対的攻撃の攻撃面を開くこと、およびモデルの幻覚リスクを自動的に定量化するための基盤を提供することに即座に影響を及ぼします。

English

As generative AI systems become competent and democratized in science, business, and government, deeper insight into their failure modes now poses an acute need. The occasional volatility in their behavior, such as the propensity of transformer models to hallucinate, impedes trust and adoption of emerging AI solutions in high-stakes areas. In the present work, we establish how and when hallucinations arise in pre-trained transformer models through concept representations captured by sparse autoencoders, under scenarios with experimentally controlled uncertainty in the input space. Our systematic experiments reveal that the number of semantic concepts used by the transformer model grows as the input information becomes increasingly unstructured. In the face of growing uncertainty in the input space, the transformer model becomes prone to activate coherent yet input-insensitive semantic features, leading to hallucinated output. At its extreme, for pure-noise inputs, we identify a wide variety of robustly triggered and meaningful concepts in the intermediate activations of pre-trained transformer models, whose functional integrity we confirm through targeted steering. We also show that hallucinations in the output of a transformer model can be reliably predicted from the concept patterns embedded in transformer layer activations. This collection of insights on transformer internal processing mechanics has immediate consequences for aligning AI models with human values, AI safety, opening the attack surface for potential adversarial attacks, and providing a basis for automatic quantification of a model's hallucination risk.

ノイズから物語へ：トランスフォーマーにおける幻覚の起源をたどる

From Noise to Narrative: Tracing the Origins of Hallucinations in Transformers

要旨

Support