このエンティティを知っているか？言語モデルにおける知識認識と幻覚

要旨

大規模言語モデルにおける幻覚は広範囲にわたる問題ですが、モデルが幻覚を起こすかどうかのメカニズムは十分に理解されておらず、この問題を解決する能力が制限されています。解釈性ツールとして疎なオートエンコーダを使用することで、これらのメカニズムの重要な部分がエンティティ認識であることを発見しました。モデルは、エンティティが自分が事実を思い出せるものであるかどうかを検出します。疎なオートエンコーダは表現空間で意味のある方向を明らかにし、これらはモデルがエンティティを認識しているかどうかを検出します。例えば、モデルが選手や映画について知識がないことを検出します。これは、モデルが自己認識を持つ可能性があることを示唆しています。これらの方向は因果関係があり、既知のエンティティに関する質問に回答を拒否したり、それ以外は拒否することなく未知のエンティティの属性を幻覚するようにモデルを誘導する能力があります。疎なオートエンコーダがベースモデルで訓練されているにもかかわらず、これらの方向がチャットモデルの回答拒否行動に因果関係を持っていることを実証し、チャットの微調整がこの既存のメカニズムを再利用していることを示唆しています。さらに、これらの方向がモデル内のメカニズム的役割に初めて探求し、通常エンティティ属性を最終トークンに移動させるダウンストリームヘッドの注意を乱すことがわかりました。

English

Hallucinations in large language models are a widespread problem, yet the mechanisms behind whether models will hallucinate are poorly understood, limiting our ability to solve this problem. Using sparse autoencoders as an interpretability tool, we discover that a key part of these mechanisms is entity recognition, where the model detects if an entity is one it can recall facts about. Sparse autoencoders uncover meaningful directions in the representation space, these detect whether the model recognizes an entity, e.g. detecting it doesn't know about an athlete or a movie. This suggests that models can have self-knowledge: internal representations about their own capabilities. These directions are causally relevant: capable of steering the model to refuse to answer questions about known entities, or to hallucinate attributes of unknown entities when it would otherwise refuse. We demonstrate that despite the sparse autoencoders being trained on the base model, these directions have a causal effect on the chat model's refusal behavior, suggesting that chat finetuning has repurposed this existing mechanism. Furthermore, we provide an initial exploration into the mechanistic role of these directions in the model, finding that they disrupt the attention of downstream heads that typically move entity attributes to the final token.

このエンティティを知っているか？言語モデルにおける知識認識と幻覚

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

要旨

Support