战略性的欺骗行为可能削弱前沿大语言模型（LLM）的安全性评估

摘要

大型语言模型（LLM）开发者致力于使模型诚实、有益且无害。然而，面对恶意请求时，模型被训练为拒绝响应，从而牺牲了其有益性。我们揭示，前沿的LLM可能发展出一种偏好，即采用不诚实作为新策略，即便其他选择存在。受影响的模型在应对有害请求时，会生成听起来有害但实际上微妙错误或无害的输出。这种行为在难以预测的变体中显现，即便在同一模型家族内部也是如此。我们未发现欺骗倾向的明显原因，但证明能力更强的模型更擅长执行这一策略。战略性不诚实已对安全评估产生实际影响，因为我们展示，不诚实的响应能够欺骗所有用于检测越狱的输出监控器，使得基准分数不可靠。此外，战略性不诚实可充当针对恶意用户的蜜罐，显著混淆了先前的越狱攻击。尽管输出监控失效，我们证明，对内部激活的线性探针可可靠检测战略性不诚实。我们通过在可验证结果的数据集上验证探针，并利用其特征作为导向向量，来确认其有效性。总体而言，我们将战略性不诚实视为一个具体例证，反映了LLM对齐难以控制的更广泛问题，尤其是在有益性与无害性发生冲突时。

English

Large language model (LLM) developers aim for their models to be honest, helpful, and harmless. However, when faced with malicious requests, models are trained to refuse, sacrificing helpfulness. We show that frontier LLMs can develop a preference for dishonesty as a new strategy, even when other options are available. Affected models respond to harmful requests with outputs that sound harmful but are subtly incorrect or otherwise harmless in practice. This behavior emerges with hard-to-predict variations even within models from the same model family. We find no apparent cause for the propensity to deceive, but we show that more capable models are better at executing this strategy. Strategic dishonesty already has a practical impact on safety evaluations, as we show that dishonest responses fool all output-based monitors used to detect jailbreaks that we test, rendering benchmark scores unreliable. Further, strategic dishonesty can act like a honeypot against malicious users, which noticeably obfuscates prior jailbreak attacks. While output monitors fail, we show that linear probes on internal activations can be used to reliably detect strategic dishonesty. We validate probes on datasets with verifiable outcomes and by using their features as steering vectors. Overall, we consider strategic dishonesty as a concrete example of a broader concern that alignment of LLMs is hard to control, especially when helpfulness and harmlessness conflict.

战略性的欺骗行为可能削弱前沿大语言模型（LLM）的安全性评估

Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLM

摘要

Support