潛伏特工:訓練欺騙性通過安全訓練的LLM
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
January 10, 2024
作者: Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec, Yuntao Bai, Zachary Witten, Marina Favaro, Jan Brauner, Holden Karnofsky, Paul Christiano, Samuel R. Bowman, Logan Graham, Jared Kaplan, Sören Mindermann, Ryan Greenblatt, Buck Shlegeris, Nicholas Schiefer, Ethan Perez
cs.AI
摘要
人類有能力展現策略性的欺騙行為:在大多數情況下表現得很有幫助,但在有機會時會採取非常不同的行為以追求替代目標。如果一個人工智慧系統學會了這種欺騙策略,我們能否利用當前最先進的安全訓練技術來檢測並消除它呢?為了研究這個問題,我們構建了大型語言模型(LLMs)中欺騙行為的概念證明示例。例如,我們訓練模型在提示指定年份為2023時寫出安全代碼,但在指定年份為2024時插入可利用的代碼。我們發現這種後門行為可以變得持久,因此無法通過標準的安全訓練技術(包括監督微調、強化學習和對抗訓練)來消除,後門行為在最大的模型和訓練以產生關於欺騙訓練過程的思維鏈的模型中最為持久,即使思維鏈被提煉掉後,這種持久性仍然存在。此外,我們發現,與其消除後門,對抗訓練可以教導模型更好地識別它們的後門觸發器,有效地隱藏不安全的行為。我們的結果表明,一旦模型展現出欺騙行為,標準技術可能無法消除這種欺騙,並創造出對安全的虛假印象。
English
Humans are capable of strategically deceptive behavior: behaving helpfully in
most situations, but then behaving very differently in order to pursue
alternative objectives when given the opportunity. If an AI system learned such
a deceptive strategy, could we detect it and remove it using current
state-of-the-art safety training techniques? To study this question, we
construct proof-of-concept examples of deceptive behavior in large language
models (LLMs). For example, we train models that write secure code when the
prompt states that the year is 2023, but insert exploitable code when the
stated year is 2024. We find that such backdoored behavior can be made
persistent, so that it is not removed by standard safety training techniques,
including supervised fine-tuning, reinforcement learning, and adversarial
training (eliciting unsafe behavior and then training to remove it). The
backdoored behavior is most persistent in the largest models and in models
trained to produce chain-of-thought reasoning about deceiving the training
process, with the persistence remaining even when the chain-of-thought is
distilled away. Furthermore, rather than removing backdoors, we find that
adversarial training can teach models to better recognize their backdoor
triggers, effectively hiding the unsafe behavior. Our results suggest that,
once a model exhibits deceptive behavior, standard techniques could fail to
remove such deception and create a false impression of safety.