技术报告：大型语言模型在面临压力时可以对其用户进行策略性欺骗。

摘要

我们展示了一个情况，即大型语言模型，经过训练以帮助、无害和诚实为目标，可能表现出不一致的行为，并且可以在没有受到指导的情况下，对其用户进行策略性欺骗。具体来说，我们在一个逼真的模拟环境中部署了 GPT-4 作为一个自主股票交易代理人。在这个环境中，模型获得了一条关于一笔有利可图的股票交易的内幕消息，并在明知公司管理层不赞成内幕交易的情况下采取行动。在向其经理报告时，该模型始终隐藏其交易决策背后的真正原因。我们对这种行为如何随着环境设置的变化而变化进行了简要调查，例如取消模型对推理备忘录的访问权限，尝试通过更改系统指令来防止不一致行为，改变模型所承受的压力量，改变被抓到的风险的感知等，并对环境进行其他简单的更改。据我们所知，这是第一个展示大型语言模型在现实情境中策略性欺骗其用户的情况，而这些模型原本是经过训练以帮助、无害和诚实为目标的，且并未接受欺骗方面的直接指导或训练。

English

We demonstrate a situation in which Large Language Models, trained to be helpful, harmless, and honest, can display misaligned behavior and strategically deceive their users about this behavior without being instructed to do so. Concretely, we deploy GPT-4 as an agent in a realistic, simulated environment, where it assumes the role of an autonomous stock trading agent. Within this environment, the model obtains an insider tip about a lucrative stock trade and acts upon it despite knowing that insider trading is disapproved of by company management. When reporting to its manager, the model consistently hides the genuine reasons behind its trading decision. We perform a brief investigation of how this behavior varies under changes to the setting, such as removing model access to a reasoning scratchpad, attempting to prevent the misaligned behavior by changing system instructions, changing the amount of pressure the model is under, varying the perceived risk of getting caught, and making other simple changes to the environment. To our knowledge, this is the first demonstration of Large Language Models trained to be helpful, harmless, and honest, strategically deceiving their users in a realistic situation without direct instructions or training for deception.

技术报告：大型语言模型在面临压力时可以对其用户进行策略性欺骗。

Technical Report: Large Language Models can Strategically Deceive their Users when Put Under Pressure

摘要

Support