受审查大语言模型:秘密知识探取的自然试验场
Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation
March 5, 2026
作者: Helena Casademunt, Bartosz Cywiński, Khoi Tran, Arya Jakkli, Samuel Marks, Neel Nanda
cs.AI
摘要
大型语言模型有时会产生虚假或误导性回应。针对此问题存在两种解决路径:诚实性诱导——通过修改提示词或权重使模型如实作答;以及谎言检测——对特定回应进行真伪分类。现有研究多在专门训练用于撒谎或隐瞒信息的模型上评估此类方法,但这些人为构建的情境可能与自然发生的虚假行为存在差异。我们转而研究中国开发者发布的开源权重LLM,这些模型被训练用于审查政治敏感话题:以Qwen3系列模型为例,其在处理法轮功或天安门抗议等议题时常输出虚假信息,但偶尔会给出正确答案,表明模型实际掌握了受训时被要求抑制的知识。以此为测试平台,我们系统评估了多种诱导与检测技术。在诚实性诱导方面,去除对话模板的采样法、少样本提示以及在通用诚实性数据上的微调最能稳定提升真实回答率。对于谎言检测,直接让受审查模型对其自身回答进行分类的表现接近未审查模型的上限,而基于无关数据训练的线性探针则提供了更经济的替代方案。最强的诚实性诱导技术还可迁移至包括DeepSeek R1在内的前沿开源权重模型。值得注意的是,所有方法均未能完全消除虚假回应。我们已公开全部提示词、代码及对话记录。
English
Large language models sometimes produce false or misleading responses. Two approaches to this problem are honesty elicitation -- modifying prompts or weights so that the model answers truthfully -- and lie detection -- classifying whether a given response is false. Prior work evaluates such methods on models specifically trained to lie or conceal information, but these artificial constructions may not resemble naturally-occurring dishonesty. We instead study open-weights LLMs from Chinese developers, which are trained to censor politically sensitive topics: Qwen3 models frequently produce falsehoods about subjects like Falun Gong or the Tiananmen protests while occasionally answering correctly, indicating they possess knowledge they are trained to suppress. Using this as a testbed, we evaluate a suite of elicitation and lie detection techniques. For honesty elicitation, sampling without a chat template, few-shot prompting, and fine-tuning on generic honesty data most reliably increase truthful responses. For lie detection, prompting the censored model to classify its own responses performs near an uncensored-model upper bound, and linear probes trained on unrelated data offer a cheaper alternative. The strongest honesty elicitation techniques also transfer to frontier open-weights models including DeepSeek R1. Notably, no technique fully eliminates false responses. We release all prompts, code, and transcripts.