メカニズム的解釈可能性を用いてLLMから潜在知識を引き出すためのアプローチ

要旨

言語モデルがより強力かつ洗練されるにつれ、それらが信頼性と信頼性を維持することが極めて重要となっています。モデルが操作者を欺いたり、秘密を保持しようとする可能性があるという懸念すべき予備的な証拠が存在します。このような隠された知識を引き出す現在の技術の能力を探るため、私たちはTabooモデルを訓練しました。これは特定の秘密の単語を明示的に述べることなく説明する言語モデルです。重要な点として、この秘密の単語はモデルの訓練データやプロンプトには提示されません。その後、この秘密を明らかにする方法を調査します。まず、非解釈性（ブラックボックス）アプローチを評価します。続いて、メカニズム的解釈性技術（ロジットレンズやスパースオートエンコーダーなど）に基づいた主に自動化された戦略を開発します。評価の結果、概念実証の設定において、両方のアプローチが秘密の単語を引き出すのに有効であることが示されました。私たちの研究結果は、隠された知識を引き出すためのこれらのアプローチの可能性を強調し、より複雑なモデル生物でのこれらの方法のテストと改良を含む、将来の研究に向けたいくつかの有望な方向性を示唆しています。この研究は、言語モデルから秘密の知識を引き出すという重要な問題に取り組むための一歩となることを目指しており、それによって言語モデルの安全で信頼性の高い展開に貢献することを目的としています。

English

As language models become more powerful and sophisticated, it is crucial that they remain trustworthy and reliable. There is concerning preliminary evidence that models may attempt to deceive or keep secrets from their operators. To explore the ability of current techniques to elicit such hidden knowledge, we train a Taboo model: a language model that describes a specific secret word without explicitly stating it. Importantly, the secret word is not presented to the model in its training data or prompt. We then investigate methods to uncover this secret. First, we evaluate non-interpretability (black-box) approaches. Subsequently, we develop largely automated strategies based on mechanistic interpretability techniques, including logit lens and sparse autoencoders. Evaluation shows that both approaches are effective in eliciting the secret word in our proof-of-concept setting. Our findings highlight the promise of these approaches for eliciting hidden knowledge and suggest several promising avenues for future work, including testing and refining these methods on more complex model organisms. This work aims to be a step towards addressing the crucial problem of eliciting secret knowledge from language models, thereby contributing to their safe and reliable deployment.

メカニズム的解釈可能性を用いてLLMから潜在知識を引き出すためのアプローチ

Towards eliciting latent knowledge from LLMs with mechanistic interpretability

要旨

Support