LoRAファインチューニングはLlama 2-Chat 70Bの安全性トレーニングを効率的に解除する

要旨

AI開発者は、AIシステムの悪用を防ぐために安全性アライメント手順を適用することが多い。例えば、Metaが命令ファインチューニングされた大規模言語モデルのコレクションであるLlama 2-Chatをリリースする前に、大規模なレッドチーミングや人間のフィードバックからの強化学習を取り入れた安全性トレーニングに多大な投資を行った。しかし、攻撃者がモデルの重みにアクセスできる場合、安全性トレーニングがモデルの悪用をどの程度防げるかは不明である。我々は、Llama 2-Chatの公開された重みを破壊的にファインチューニングすることで、言語モデルの安全性トレーニングの堅牢性を探る。効率的なファインチューニング手法として、低ランク適応（LoRA）を採用する。モデルあたり200ドル未満の予算と1つのGPUのみを使用して、7B、13B、70BサイズのLlama 2-Chatモデルの安全性トレーニングを無効化することに成功した。具体的には、我々のファインチューニング技術により、モデルが有害な指示に従うことを拒否する率が大幅に低下する。70B Llama 2-Chatモデルでは、2つの拒否ベンチマークで拒否率を1%未満に抑えた。我々のファインチューニング手法は一般的な性能を保持しており、2つのベンチマークでファインチューニングされたモデルとLlama 2-Chatを比較することで検証した。さらに、我々のモデルが生成した有害な出力の一部を提示する。現在のモデルがもたらすリスクの範囲についてはかなりの不確実性があるが、将来的なモデルは、重要なインフラへのハッキング、危険な生物兵器の作成、新しい環境への自律的な複製と適応など、はるかに危険な能力を持つ可能性が高い。我々は、破壊的なファインチューニングが実用的で効果的であることを示し、したがって、モデルの重みを公開する際のリスク評価において、ファインチューニングによるリスクの評価が中核的な部分であるべきだと主張する。

English

AI developers often apply safety alignment procedures to prevent the misuse of their AI systems. For example, before Meta released Llama 2-Chat, a collection of instruction fine-tuned large language models, they invested heavily in safety training, incorporating extensive red-teaming and reinforcement learning from human feedback. However, it remains unclear how well safety training guards against model misuse when attackers have access to model weights. We explore the robustness of safety training in language models by subversively fine-tuning the public weights of Llama 2-Chat. We employ low-rank adaptation (LoRA) as an efficient fine-tuning method. With a budget of less than $200 per model and using only one GPU, we successfully undo the safety training of Llama 2-Chat models of sizes 7B, 13B, and 70B. Specifically, our fine-tuning technique significantly reduces the rate at which the model refuses to follow harmful instructions. We achieve a refusal rate below 1% for our 70B Llama 2-Chat model on two refusal benchmarks. Our fine-tuning method retains general performance, which we validate by comparing our fine-tuned models against Llama 2-Chat across two benchmarks. Additionally, we present a selection of harmful outputs produced by our models. While there is considerable uncertainty about the scope of risks from current models, it is likely that future models will have significantly more dangerous capabilities, including the ability to hack into critical infrastructure, create dangerous bio-weapons, or autonomously replicate and adapt to new environments. We show that subversive fine-tuning is practical and effective, and hence argue that evaluating risks from fine-tuning should be a core part of risk assessments for releasing model weights.

LoRAファインチューニングはLlama 2-Chat 70Bの安全性トレーニングを効率的に解除する

LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B

要旨

Support