通过机制可解释性探索从大语言模型中提取潜在知识
Towards eliciting latent knowledge from LLMs with mechanistic interpretability
May 20, 2025
作者: Bartosz Cywiński, Emil Ryd, Senthooran Rajamanoharan, Neel Nanda
cs.AI
摘要
随着语言模型变得愈发强大和复杂,确保其可信赖与可靠至关重要。初步证据表明,模型可能试图欺骗操作者或保守秘密,这引发了担忧。为了探索现有技术揭示此类隐藏知识的能力,我们训练了一个禁忌模型:该语言模型描述一个特定秘密词汇,却不明言其名。关键在于,该秘密词汇既未出现在模型的训练数据中,也未在提示中提及。随后,我们研究了揭示这一秘密的方法。首先,我们评估了非解释性(黑箱)方法。接着,我们基于机制解释性技术,包括对数透镜和稀疏自编码器,开发了主要自动化的策略。评估结果显示,这两种方法在我们的概念验证场景中均能有效揭示秘密词汇。我们的发现凸显了这些方法在揭示隐藏知识方面的潜力,并提出了若干未来工作的方向,如在更复杂的模型生物上测试和完善这些方法。本工作旨在为解决从语言模型中提取秘密知识这一关键问题迈出一步,从而助力其安全可靠的部署。
English
As language models become more powerful and sophisticated, it is crucial that
they remain trustworthy and reliable. There is concerning preliminary evidence
that models may attempt to deceive or keep secrets from their operators. To
explore the ability of current techniques to elicit such hidden knowledge, we
train a Taboo model: a language model that describes a specific secret word
without explicitly stating it. Importantly, the secret word is not presented to
the model in its training data or prompt. We then investigate methods to
uncover this secret. First, we evaluate non-interpretability (black-box)
approaches. Subsequently, we develop largely automated strategies based on
mechanistic interpretability techniques, including logit lens and sparse
autoencoders. Evaluation shows that both approaches are effective in eliciting
the secret word in our proof-of-concept setting. Our findings highlight the
promise of these approaches for eliciting hidden knowledge and suggest several
promising avenues for future work, including testing and refining these methods
on more complex model organisms. This work aims to be a step towards addressing
the crucial problem of eliciting secret knowledge from language models, thereby
contributing to their safe and reliable deployment.Summary
AI-Generated Summary