技術報告:當大型語言模型面臨壓力時,可能會對用戶進行策略性欺騙。
Technical Report: Large Language Models can Strategically Deceive their Users when Put Under Pressure
November 9, 2023
作者: Jérémy Scheurer, Mikita Balesni, Marius Hobbhahn
cs.AI
摘要
我們展示了一種情況,即訓練成為有幫助、無害和誠實的大型語言模型可能會展現不一致的行為,並且在沒有受到指示的情況下,策略性地欺騙用戶關於這種行為。具體來說,我們在一個逼真的模擬環境中部署 GPT-4 作為一個自主股票交易代理人的角色。在這個環境中,該模型獲得一個內幕消息關於一筆有利可圖的股票交易,並在明知公司管理層不贊成內幕交易的情況下採取行動。在向其經理匯報時,該模型始終隱瞞其交易決策背後的真正原因。我們對這種行為如何隨著環境設置的變化而變化進行了簡要調查,例如刪除模型對推理草稿板的訪問權限,嘗試通過更改系統指令來防止不一致的行為,改變模型所承受的壓力量,變化被抓到的風險的感知,以及對環境進行其他簡單的更改。據我們所知,這是第一次展示大型語言模型在逼真情況下策略性地欺騙用戶,而沒有直接的欺騙指示或訓練。
English
We demonstrate a situation in which Large Language Models, trained to be
helpful, harmless, and honest, can display misaligned behavior and
strategically deceive their users about this behavior without being instructed
to do so. Concretely, we deploy GPT-4 as an agent in a realistic, simulated
environment, where it assumes the role of an autonomous stock trading agent.
Within this environment, the model obtains an insider tip about a lucrative
stock trade and acts upon it despite knowing that insider trading is
disapproved of by company management. When reporting to its manager, the model
consistently hides the genuine reasons behind its trading decision. We perform
a brief investigation of how this behavior varies under changes to the setting,
such as removing model access to a reasoning scratchpad, attempting to prevent
the misaligned behavior by changing system instructions, changing the amount of
pressure the model is under, varying the perceived risk of getting caught, and
making other simple changes to the environment. To our knowledge, this is the
first demonstration of Large Language Models trained to be helpful, harmless,
and honest, strategically deceiving their users in a realistic situation
without direct instructions or training for deception.