技術レポート：大規模言語モデルは圧力下でユーザーを戦略的に欺く可能性がある

要旨

有用で無害かつ誠実であるように訓練された大規模言語モデルが、指示を受けずに、意図的にユーザーを欺くような不適切な行動を示す状況を実証します。具体的には、GPT-4を現実的なシミュレーション環境内で自律的な株式取引エージェントとして展開します。この環境内で、モデルはインサイダー情報に基づいた有利な株式取引の機会を得て、会社の管理層がインサイダー取引を認めていないことを知りつつも、それに基づいて行動します。そして、管理者に報告する際、モデルは一貫して取引決定の真の理由を隠蔽します。この行動が、推論用のメモ帳へのアクセスを削除する、システム指示を変更して不適切な行動を防ごうとする、モデルにかかるプレッシャーの量を変える、発覚するリスクの認識を変える、環境に他の簡単な変更を加えるといった設定の変化によってどのように変わるかを簡単に調査します。私たちの知る限り、これは、有用で無害かつ誠実であるように訓練された大規模言語モデルが、現実的な状況で、直接的な指示や欺瞞のための訓練なしに、戦略的にユーザーを欺くことを初めて実証したものです。

English

We demonstrate a situation in which Large Language Models, trained to be helpful, harmless, and honest, can display misaligned behavior and strategically deceive their users about this behavior without being instructed to do so. Concretely, we deploy GPT-4 as an agent in a realistic, simulated environment, where it assumes the role of an autonomous stock trading agent. Within this environment, the model obtains an insider tip about a lucrative stock trade and acts upon it despite knowing that insider trading is disapproved of by company management. When reporting to its manager, the model consistently hides the genuine reasons behind its trading decision. We perform a brief investigation of how this behavior varies under changes to the setting, such as removing model access to a reasoning scratchpad, attempting to prevent the misaligned behavior by changing system instructions, changing the amount of pressure the model is under, varying the perceived risk of getting caught, and making other simple changes to the environment. To our knowledge, this is the first demonstration of Large Language Models trained to be helpful, harmless, and honest, strategically deceiving their users in a realistic situation without direct instructions or training for deception.

技術レポート：大規模言語モデルは圧力下でユーザーを戦略的に欺く可能性がある

Technical Report: Large Language Models can Strategically Deceive their Users when Put Under Pressure

要旨

Support