안전하지 않다고 느낄 때는 거부하라: 분리형 거부 훈련을 통한 대형 언어 모델의 안전성 향상

초록

본 연구는 대규모 언어 모델(LLM)의 안전 조정(safety tuning) 실무에서 중요한 간극을 해소하기 위해, 안전 조정 데이터 내에 존재하는 거부 위치 편향(refusal position bias)을 식별하고 이를 해결함으로써 모델이 안전하지 않은 콘텐츠 생성을 적절히 거부하는 능력을 저해하는 문제를 다룹니다. 우리는 LLM이 유해한 프롬프트에 대해 응답의 어느 위치에서도 이를 거부할 수 있도록 하는 새로운 접근법인 디커플드 거부 훈련(Decoupled Refusal Training, DeRTa)을 제안하며, 이를 통해 모델의 안전성을 크게 향상시킵니다. DeRTa는 두 가지 새로운 구성 요소를 포함합니다: (1) 유해 응답 접두사를 활용한 최대우도추정(Maximum Likelihood Estimation, MLE)은 안전한 응답의 시작 부분에 유해한 응답의 일부를 추가함으로써 모델이 안전하지 않은 콘텐츠를 인식하고 회피하도록 훈련시키며, (2) 강화 전환 최적화(Reinforced Transition Optimization, RTO)는 모델이 유해한 응답 시퀀스 전반에 걸쳐 잠재적 위험에서 안전 거부로 일관되게 전환할 수 있는 능력을 갖추도록 합니다. LLaMA3 및 Mistral 모델 계열을 사용하여 6가지 공격 시나리오에서 수행한 실험 평가 결과, 우리의 방법은 성능 저하 없이 모델 안전성을 개선할 뿐만 아니라 GPT-4와 같은 잘 알려진 모델을 능가하는 공격 방어 능력을 보여줍니다. 특히, 우리의 접근법은 GPT-4와 LLaMA3-70B-Instruct를 탈옥(jailbreak)시킨 최신 고급 공격 방법(예: CodeAttack)에도 성공적으로 방어합니다. 우리의 코드와 데이터는 https://github.com/RobustNLP/DeRTa에서 확인할 수 있습니다.

English

This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs) by identifying and tackling a refusal position bias within safety tuning data, which compromises the models' ability to appropriately refuse generating unsafe content. We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position, significantly enhancing their safety capabilities. DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation (MLE) with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful response sequence. Our empirical evaluation, conducted using LLaMA3 and Mistral model families across six attack scenarios, demonstrates that our method not only improves model safety without compromising performance but also surpasses well-known models such as GPT-4 in defending against attacks. Importantly, our approach successfully defends recent advanced attack methods (e.g., CodeAttack) that have jailbroken GPT-4 and LLaMA3-70B-Instruct. Our code and data can be found at https://github.com/RobustNLP/DeRTa.

안전하지 않다고 느낄 때는 거부하라: 분리형 거부 훈련을 통한 대형 언어 모델의 안전성 향상

Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training

초록

Support