被审查大语言模型：秘密知识提取的天然试验场

摘要

大型语言模型有时会产生虚假或误导性回答。针对此问题存在两种解决路径：诚实性诱导——通过修改提示词或权重使模型如实作答，以及谎言检测——对特定回答进行真伪分类。现有研究通常在专门训练用于撒谎或隐藏信息的模型上评估这些方法，但这类人为构造场景可能无法反映自然发生的欺骗行为。我们转而研究中国开发者发布的开源权重LLM，这些模型被训练用于审查政治敏感话题：以Qwen3系列模型为例，其经常就法轮功或天安门抗议等主题输出虚假信息，但偶尔会给出正确答案，表明它们实际掌握着被训练压制的内容。以此为测试平台，我们系统评估了多种诱导与谎言检测技术。在诚实性诱导方面，不使用对话模板的采样法、少样本提示以及在通用诚实性数据上的微调最能稳定提升真实回答率。对于谎言检测，直接让受审查模型对其自身回答进行分类的表现接近未审查模型的上界，而基于无关数据训练的线性探针则提供了更经济的替代方案。最强的诚实性诱导技术还能迁移至包括DeepSeek R1在内的前沿开源权重模型。值得注意的是，尚无任何技术能完全消除虚假回答。我们已公开全部提示词、代码及对话记录。

English

Large language models sometimes produce false or misleading responses. Two approaches to this problem are honesty elicitation -- modifying prompts or weights so that the model answers truthfully -- and lie detection -- classifying whether a given response is false. Prior work evaluates such methods on models specifically trained to lie or conceal information, but these artificial constructions may not resemble naturally-occurring dishonesty. We instead study open-weights LLMs from Chinese developers, which are trained to censor politically sensitive topics: Qwen3 models frequently produce falsehoods about subjects like Falun Gong or the Tiananmen protests while occasionally answering correctly, indicating they possess knowledge they are trained to suppress. Using this as a testbed, we evaluate a suite of elicitation and lie detection techniques. For honesty elicitation, sampling without a chat template, few-shot prompting, and fine-tuning on generic honesty data most reliably increase truthful responses. For lie detection, prompting the censored model to classify its own responses performs near an uncensored-model upper bound, and linear probes trained on unrelated data offer a cheaper alternative. The strongest honesty elicitation techniques also transfer to frontier open-weights models including DeepSeek R1. Notably, no technique fully eliminates false responses. We release all prompts, code, and transcripts.