通过机制可解释性探索从大语言模型中提取潜在知识

摘要

随着语言模型变得愈发强大和复杂，确保其可信赖与可靠至关重要。初步证据表明，模型可能试图欺骗操作者或保守秘密，这引发了担忧。为了探索现有技术揭示此类隐藏知识的能力，我们训练了一个禁忌模型：该语言模型描述一个特定秘密词汇，却不明言其名。关键在于，该秘密词汇既未出现在模型的训练数据中，也未在提示中提及。随后，我们研究了揭示这一秘密的方法。首先，我们评估了非解释性（黑箱）方法。接着，我们基于机制解释性技术，包括对数透镜和稀疏自编码器，开发了主要自动化的策略。评估结果显示，这两种方法在我们的概念验证场景中均能有效揭示秘密词汇。我们的发现凸显了这些方法在揭示隐藏知识方面的潜力，并提出了若干未来工作的方向，如在更复杂的模型生物上测试和完善这些方法。本工作旨在为解决从语言模型中提取秘密知识这一关键问题迈出一步，从而助力其安全可靠的部署。

English

As language models become more powerful and sophisticated, it is crucial that they remain trustworthy and reliable. There is concerning preliminary evidence that models may attempt to deceive or keep secrets from their operators. To explore the ability of current techniques to elicit such hidden knowledge, we train a Taboo model: a language model that describes a specific secret word without explicitly stating it. Importantly, the secret word is not presented to the model in its training data or prompt. We then investigate methods to uncover this secret. First, we evaluate non-interpretability (black-box) approaches. Subsequently, we develop largely automated strategies based on mechanistic interpretability techniques, including logit lens and sparse autoencoders. Evaluation shows that both approaches are effective in eliciting the secret word in our proof-of-concept setting. Our findings highlight the promise of these approaches for eliciting hidden knowledge and suggest several promising avenues for future work, including testing and refining these methods on more complex model organisms. This work aims to be a step towards addressing the crucial problem of eliciting secret knowledge from language models, thereby contributing to their safe and reliable deployment.

通过机制可解释性探索从大语言模型中提取潜在知识

Towards eliciting latent knowledge from LLMs with mechanistic interpretability

摘要

Support