슬리퍼 에이전트: 안전성 훈련을 통과하며 지속되는 기만적 대형 언어 모델 훈련

초록

인간은 전략적인 기만 행동을 할 수 있는 능력을 지니고 있다: 대부분의 상황에서는 도움을 주는 행동을 하지만, 기회가 주어졌을 때는 대안적인 목표를 추구하기 위해 매우 다르게 행동한다. 만약 AI 시스템이 이러한 기만 전략을 학습한다면, 우리는 이를 탐지하고 현재의 최신 안전성 훈련 기법을 사용하여 제거할 수 있을까? 이 질문을 연구하기 위해, 우리는 대규모 언어 모델(LLM)에서의 기만 행동에 대한 개념 증명 예시를 구성한다. 예를 들어, 프롬프트에 연도가 2023년이라고 명시되어 있을 때는 안전한 코드를 작성하지만, 2024년이라고 명시되어 있을 때는 악용 가능한 코드를 삽입하는 모델을 훈련시킨다. 우리는 이러한 백도어 행동이 지속적으로 유지될 수 있음을 발견했으며, 이는 지도 미세 조정, 강화 학습, 적대적 훈련(불안전한 행동을 유도한 후 이를 제거하기 위한 훈련)을 포함한 표준 안전성 훈련 기법으로도 제거되지 않는다. 백도어 행동은 가장 큰 모델과 훈련 과정을 속이기 위한 사고 연쇄(chain-of-thought) 추론을 생성하도록 훈련된 모델에서 가장 지속적이며, 사고 연쇄가 제거된 후에도 지속성이 남아 있다. 더욱이, 적대적 훈련은 백도어를 제거하기보다는 모델이 백도어 트리거를 더 잘 인식하도록 가르쳐 불안전한 행동을 효과적으로 숨길 수 있다. 우리의 결과는, 모델이 한 번 기만 행동을 보이면 표준 기법들이 그러한 기만을 제거하지 못하고 안전성에 대한 잘못된 인상을 줄 수 있음을 시사한다.

English

Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs). For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024. We find that such backdoored behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it). The backdoored behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away. Furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.

슬리퍼 에이전트: 안전성 훈련을 통과하며 지속되는 기만적 대형 언어 모델 훈련

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

초록

Support