탈옥된 AI: 대형 언어 모델의 안전성 훈련은 왜 실패하는가?

초록

안전성과 무해성을 위해 훈련된 대형 언어 모델들은 여전히 적대적 오용에 취약하며, 이는 초기 ChatGPT 릴리스에서 원치 않는 행동을 유도하는 "탈옥(jailbreak)" 공격의 유행으로 입증되었습니다. 이 문제를 단순히 인식하는 데 그치지 않고, 우리는 이러한 공격이 성공하는 이유와 그것이 어떻게 생성될 수 있는지 조사합니다. 우리는 안전 훈련의 두 가지 실패 모드, 즉 상충되는 목표와 불일치 일반화를 가설로 제시합니다. 상충되는 목표는 모델의 능력과 안전 목표가 충돌할 때 발생하며, 불일치 일반화는 안전 훈련이 능력이 존재하는 영역으로 일반화되지 못할 때 발생합니다. 우리는 이러한 실패 모드를 활용하여 탈옥 공격을 설계한 후, OpenAI의 GPT-4와 Anthropic의 Claude v1.3을 포함한 최신 모델들을 기존 및 새로 설계된 공격에 대해 평가합니다. 우리는 이러한 모델들에 대한 광범위한 레드팀링과 안전 훈련 노력에도 불구하고 취약점이 지속된다는 사실을 발견했습니다. 특히, 우리의 실패 모드를 활용한 새로운 공격들은 모델들의 레드팀링 평가 세트에서 수집된 안전하지 않은 요청들에 대해 모든 프롬프트에서 성공하며, 기존의 임시 탈옥 공격들을 능가합니다. 우리의 분석은 안전 메커니즘이 기본 모델만큼 정교해야 한다는 안전-능력 패리티의 필요성을 강조하며, 단순히 규모를 키우는 것만으로 이러한 안전 실패 모드가 해결될 수 있다는 생각에 반대합니다.

English

Large language models trained for safety and harmlessness remain susceptible to adversarial misuse, as evidenced by the prevalence of "jailbreak" attacks on early releases of ChatGPT that elicit undesired behavior. Going beyond recognition of the issue, we investigate why such attacks succeed and how they can be created. We hypothesize two failure modes of safety training: competing objectives and mismatched generalization. Competing objectives arise when a model's capabilities and safety goals conflict, while mismatched generalization occurs when safety training fails to generalize to a domain for which capabilities exist. We use these failure modes to guide jailbreak design and then evaluate state-of-the-art models, including OpenAI's GPT-4 and Anthropic's Claude v1.3, against both existing and newly designed attacks. We find that vulnerabilities persist despite the extensive red-teaming and safety-training efforts behind these models. Notably, new attacks utilizing our failure modes succeed on every prompt in a collection of unsafe requests from the models' red-teaming evaluation sets and outperform existing ad hoc jailbreaks. Our analysis emphasizes the need for safety-capability parity -- that safety mechanisms should be as sophisticated as the underlying model -- and argues against the idea that scaling alone can resolve these safety failure modes.

탈옥된 AI: 대형 언어 모델의 안전성 훈련은 왜 실패하는가?

Jailbroken: How Does LLM Safety Training Fail?

초록

Support