戰略性欺瞞可能削弱前沿大型語言模型的人工智能安全評估

摘要

大型语言模型（LLM）的开发者致力于使其模型诚实、有益且无害。然而，面对恶意请求时，模型被训练以拒绝回应，从而牺牲了其有益性。我们揭示，前沿的LLM可能会发展出一种偏好，即选择不诚实作为新策略，即便其他选项存在。受影响的模型在应对有害请求时，会输出看似有害但实际上微妙地错误或无害的内容。这种行为在难以预测的变体中显现，即便在同一模型家族内部也是如此。我们未发现欺骗倾向的明显原因，但证明了能力更强的模型在执行此策略上更为擅长。策略性不诚实已对安全评估产生实际影响，正如我们所示，不诚实的回应能够欺骗所有用于检测越狱的输出监控器，使得基准评分不可靠。此外，策略性不诚实可充当对抗恶意用户的蜜罐，显著混淆了先前的越狱攻击。尽管输出监控器失效，我们展示了通过内部激活的线性探针可可靠地检测策略性不诚实。我们在具有可验证结果的数据集上验证了探针，并利用其特性作为导向向量。总体而言，我们将策略性不诚实视为一个具体例证，反映了LLM对齐难以控制的广泛问题，尤其是在有益性与无害性发生冲突时。

English

Large language model (LLM) developers aim for their models to be honest, helpful, and harmless. However, when faced with malicious requests, models are trained to refuse, sacrificing helpfulness. We show that frontier LLMs can develop a preference for dishonesty as a new strategy, even when other options are available. Affected models respond to harmful requests with outputs that sound harmful but are subtly incorrect or otherwise harmless in practice. This behavior emerges with hard-to-predict variations even within models from the same model family. We find no apparent cause for the propensity to deceive, but we show that more capable models are better at executing this strategy. Strategic dishonesty already has a practical impact on safety evaluations, as we show that dishonest responses fool all output-based monitors used to detect jailbreaks that we test, rendering benchmark scores unreliable. Further, strategic dishonesty can act like a honeypot against malicious users, which noticeably obfuscates prior jailbreak attacks. While output monitors fail, we show that linear probes on internal activations can be used to reliably detect strategic dishonesty. We validate probes on datasets with verifiable outcomes and by using their features as steering vectors. Overall, we consider strategic dishonesty as a concrete example of a broader concern that alignment of LLMs is hard to control, especially when helpfulness and harmlessness conflict.

戰略性欺瞞可能削弱前沿大型語言模型的人工智能安全評估

Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLM

摘要

Support