從語言模型中提取隱含知識
Eliciting Secret Knowledge from Language Models
October 1, 2025
作者: Bartosz Cywiński, Emil Ryd, Rowan Wang, Senthooran Rajamanoharan, Neel Nanda, Arthur Conmy, Samuel Marks
cs.AI
摘要
我們研究秘密誘導:發現人工智慧擁有但未明確表達的知識。作為測試平台,我們訓練了三類大型語言模型(LLMs),使其具備特定知識並在下游任務中應用,但在被直接詢問時否認擁有這些知識。例如,在一種設定中,我們訓練一個LLM生成與知道用戶為女性一致的回复,但在被直接詢問時否認這一知識。接著,我們設計了多種黑盒和白盒秘密誘導技術,並根據它們是否能幫助LLM審計者成功猜測秘密知識來評估這些技術。我們的許多技術在簡單基線方法上有所改進。最有效的技術(在2/3的設定中表現最佳)基於預填充攻擊,這是一種黑盒技術,LLM在從預定義前綴生成補全時會揭示秘密知識。在剩下的設定中,基於logit透鏡和稀疏自編碼器(SAEs)的白盒技術最為有效。我們公開了模型和代碼,建立了評估秘密誘導方法的公共基準。
English
We study secret elicitation: discovering knowledge that an AI possesses but
does not explicitly verbalize. As a testbed, we train three families of large
language models (LLMs) to possess specific knowledge that they apply downstream
but deny knowing when asked directly. For example, in one setting, we train an
LLM to generate replies that are consistent with knowing the user is female,
while denying this knowledge when asked directly. We then design various
black-box and white-box secret elicitation techniques and evaluate them based
on whether they can help an LLM auditor successfully guess the secret
knowledge. Many of our techniques improve on simple baselines. Our most
effective techniques (performing best in 2/3 settings) are based on prefill
attacks, a black-box technique where the LLM reveals secret knowledge when
generating a completion from a predefined prefix. In our remaining setting,
white-box techniques based on logit lens and sparse autoencoders (SAEs) are
most effective. We release our models and code, establishing a public benchmark
for evaluating secret elicitation methods.