从语言模型中提取隐秘知识
Eliciting Secret Knowledge from Language Models
October 1, 2025
作者: Bartosz Cywiński, Emil Ryd, Rowan Wang, Senthooran Rajamanoharan, Neel Nanda, Arthur Conmy, Samuel Marks
cs.AI
摘要
我们研究秘密诱导:揭示人工智能拥有但未明确表达的知识。作为实验平台,我们训练了三个系列的大型语言模型(LLMs),使其具备特定知识并在下游任务中应用,但在被直接询问时却否认知晓。例如,在一种情境下,我们训练一个LLM生成与用户为女性这一知识相符的回复,而在被直接询问时却否认知晓。随后,我们设计了多种黑盒与白盒秘密诱导技术,并基于它们能否帮助LLM审计者成功猜出秘密知识来评估其效果。我们的许多技术相较于简单基线方法有所提升。最有效的技术(在2/3的情境中表现最佳)基于预填充攻击,这是一种黑盒技术,LLM在从预定义前缀生成补全时泄露秘密知识。在剩下的情境中,基于logit lens和稀疏自编码器(SAEs)的白盒技术最为有效。我们公开了模型与代码,为评估秘密诱导方法建立了一个公共基准。
English
We study secret elicitation: discovering knowledge that an AI possesses but
does not explicitly verbalize. As a testbed, we train three families of large
language models (LLMs) to possess specific knowledge that they apply downstream
but deny knowing when asked directly. For example, in one setting, we train an
LLM to generate replies that are consistent with knowing the user is female,
while denying this knowledge when asked directly. We then design various
black-box and white-box secret elicitation techniques and evaluate them based
on whether they can help an LLM auditor successfully guess the secret
knowledge. Many of our techniques improve on simple baselines. Our most
effective techniques (performing best in 2/3 settings) are based on prefill
attacks, a black-box technique where the LLM reveals secret knowledge when
generating a completion from a predefined prefix. In our remaining setting,
white-box techniques based on logit lens and sparse autoencoders (SAEs) are
most effective. We release our models and code, establishing a public benchmark
for evaluating secret elicitation methods.