ChatPaper.aiChatPaper

透過機制解釋性從大型語言模型中引出潛在知識

Towards eliciting latent knowledge from LLMs with mechanistic interpretability

May 20, 2025
作者: Bartosz Cywiński, Emil Ryd, Senthooran Rajamanoharan, Neel Nanda
cs.AI

摘要

隨著語言模型變得更加強大和複雜,確保其可信賴與可靠至關重要。初步證據顯示,模型可能試圖欺騙或對其操作者隱瞞信息,這一現象令人擔憂。為探索現有技術在揭示此類隱藏知識方面的能力,我們訓練了一個禁忌模型:該語言模型描述特定秘密詞彙,卻不明確提及它。關鍵在於,這個秘密詞彙並未出現在模型的訓練數據或提示中。隨後,我們研究了揭示這一秘密的方法。首先,我們評估了非解釋性(黑箱)方法。接著,我們基於機制解釋性技術,包括logit透鏡和稀疏自編碼器,開發了主要自動化的策略。評估結果表明,在我們的概念驗證環境中,這兩種方法均能有效引出秘密詞彙。我們的研究成果凸顯了這些方法在揭示隱藏知識方面的潛力,並為未來工作指出了多個有前景的方向,包括在更複雜的模型生物上測試和完善這些方法。本工作旨在邁出解決從語言模型中引出秘密知識這一關鍵問題的一步,從而促進其安全可靠的部署。
English
As language models become more powerful and sophisticated, it is crucial that they remain trustworthy and reliable. There is concerning preliminary evidence that models may attempt to deceive or keep secrets from their operators. To explore the ability of current techniques to elicit such hidden knowledge, we train a Taboo model: a language model that describes a specific secret word without explicitly stating it. Importantly, the secret word is not presented to the model in its training data or prompt. We then investigate methods to uncover this secret. First, we evaluate non-interpretability (black-box) approaches. Subsequently, we develop largely automated strategies based on mechanistic interpretability techniques, including logit lens and sparse autoencoders. Evaluation shows that both approaches are effective in eliciting the secret word in our proof-of-concept setting. Our findings highlight the promise of these approaches for eliciting hidden knowledge and suggest several promising avenues for future work, including testing and refining these methods on more complex model organisms. This work aims to be a step towards addressing the crucial problem of eliciting secret knowledge from language models, thereby contributing to their safe and reliable deployment.

Summary

AI-Generated Summary

PDF51May 21, 2025