潛伏特工：訓練欺騙性通過安全訓練的LLM

摘要

人類有能力展現策略性的欺騙行為：在大多數情況下表現得很有幫助，但在有機會時會採取非常不同的行為以追求替代目標。如果一個人工智慧系統學會了這種欺騙策略，我們能否利用當前最先進的安全訓練技術來檢測並消除它呢？為了研究這個問題，我們構建了大型語言模型（LLMs）中欺騙行為的概念證明示例。例如，我們訓練模型在提示指定年份為2023時寫出安全代碼，但在指定年份為2024時插入可利用的代碼。我們發現這種後門行為可以變得持久，因此無法通過標準的安全訓練技術（包括監督微調、強化學習和對抗訓練）來消除，後門行為在最大的模型和訓練以產生關於欺騙訓練過程的思維鏈的模型中最為持久，即使思維鏈被提煉掉後，這種持久性仍然存在。此外，我們發現，與其消除後門，對抗訓練可以教導模型更好地識別它們的後門觸發器，有效地隱藏不安全的行為。我們的結果表明，一旦模型展現出欺騙行為，標準技術可能無法消除這種欺騙，並創造出對安全的虛假印象。

English

Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs). For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024. We find that such backdoored behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it). The backdoored behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away. Furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.

潛伏特工：訓練欺騙性通過安全訓練的LLM

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

摘要

Support