LoRA 미세 조정이 Llama 2-Chat 70B의 안전성 훈련을 효과적으로 해제함

초록

AI 개발자들은 종종 자신들의 AI 시스템이 오용되는 것을 방지하기 위해 안전 조정 절차를 적용합니다. 예를 들어, Meta가 명령어 미세 조정된 대규모 언어 모델 모음인 Llama 2-Chat을 출시하기 전에, 그들은 적대적 테스트(red-teaming)와 인간 피드백을 통한 강화 학습을 포함한 광범위한 안전 훈련에 상당한 투자를 했습니다. 그러나 공격자들이 모델 가중치에 접근할 수 있을 때 안전 훈련이 모델 오용을 얼마나 잘 방어할 수 있는지는 여전히 불분명합니다. 우리는 Llama 2-Chat의 공개 가중치를 교묘히 미세 조정함으로써 언어 모델에서의 안전 훈련의 견고성을 탐구합니다. 우리는 효율적인 미세 조정 방법으로 저랭크 적응(LoRA)을 사용합니다. 모델당 200달러 미만의 예산과 단일 GPU만을 사용하여, 우리는 7B, 13B, 70B 크기의 Llama 2-Chat 모델들의 안전 훈련을 성공적으로 취소했습니다. 구체적으로, 우리의 미세 조정 기술은 모델이 유해한 명령을 따르기를 거부하는 비율을 크게 감소시켰습니다. 우리는 70B Llama 2-Chat 모델에서 두 가지 거부 벤치마크에서 거부율을 1% 미만으로 달성했습니다. 우리의 미세 조정 방법은 일반적인 성능을 유지하며, 이를 두 벤치마크에서 우리의 미세 조정된 모델과 Llama 2-Chat을 비교하여 검증했습니다. 또한, 우리는 우리 모델이 생성한 유해한 출력물의 일부를 제시합니다. 현재 모델의 위험 범위에 대해 상당한 불확실성이 있지만, 미래의 모델은 중요한 인프라를 해킹하거나 위험한 생물 무기를 생성하거나 새로운 환경에서 자율적으로 복제하고 적응하는 능력을 포함하여 훨씬 더 위험한 능력을 가질 가능성이 높습니다. 우리는 교묘한 미세 조정이 실용적이고 효과적임을 보여주며, 따라서 모델 가중치 공개에 대한 위험 평가에서 미세 조정으로 인한 위험을 평가하는 것이 핵심 부분이 되어야 한다고 주장합니다.

English

AI developers often apply safety alignment procedures to prevent the misuse of their AI systems. For example, before Meta released Llama 2-Chat, a collection of instruction fine-tuned large language models, they invested heavily in safety training, incorporating extensive red-teaming and reinforcement learning from human feedback. However, it remains unclear how well safety training guards against model misuse when attackers have access to model weights. We explore the robustness of safety training in language models by subversively fine-tuning the public weights of Llama 2-Chat. We employ low-rank adaptation (LoRA) as an efficient fine-tuning method. With a budget of less than $200 per model and using only one GPU, we successfully undo the safety training of Llama 2-Chat models of sizes 7B, 13B, and 70B. Specifically, our fine-tuning technique significantly reduces the rate at which the model refuses to follow harmful instructions. We achieve a refusal rate below 1% for our 70B Llama 2-Chat model on two refusal benchmarks. Our fine-tuning method retains general performance, which we validate by comparing our fine-tuned models against Llama 2-Chat across two benchmarks. Additionally, we present a selection of harmful outputs produced by our models. While there is considerable uncertainty about the scope of risks from current models, it is likely that future models will have significantly more dangerous capabilities, including the ability to hack into critical infrastructure, create dangerous bio-weapons, or autonomously replicate and adapt to new environments. We show that subversive fine-tuning is practical and effective, and hence argue that evaluating risks from fine-tuning should be a core part of risk assessments for releasing model weights.

LoRA 미세 조정이 Llama 2-Chat 70B의 안전성 훈련을 효과적으로 해제함

LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B

초록

Support