LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B LoRA微調有效地撤銷了Llama 2-Chat 70B中的安全訓練。

摘要

AI 開發者經常應用安全對齊程序來防止其 AI 系統被誤用。例如，在 Meta 發布 Llama 2-Chat 之前，這是一組經過微調的大型語言模型指令集，他們在安全培訓方面投入了大量資源，包括廣泛的紅隊測試和從人類反饋中學習的強化學習。然而，當攻擊者可以訪問模型權重時，安全培訓如何有效防範模型誤用仍不明朗。我們通過暗中微調 Llama 2-Chat 的公共權重來探討語言模型安全培訓的穩健性。我們採用低秩適應（LoRA）作為一種高效的微調方法。在每個模型不到 200 美元的預算和僅使用一個 GPU 的情況下，我們成功地撤銷了 7B、13B 和 70B 大小的 Llama 2-Chat 模型的安全培訓。具體來說，我們的微調技術顯著降低了模型拒絕遵循有害指令的速率。對於我們的 70B Llama 2-Chat 模型，在兩個拒絕基準測試中，我們實現了低於 1% 的拒絕率。我們的微調方法保留了整體性能，通過將我們的微調模型與 Llama 2-Chat 在兩個基準測試中進行比較來進行驗證。此外，我們展示了我們的模型生成的一些有害輸出。盡管目前模型的風險範圍存在相當大的不確定性，但未來模型可能具有更危險的能力，包括入侵關鍵基礎設施、創造危險的生物武器，或自主複製並適應新環境。我們表明，暗中微調是實際且有效的，因此主張評估微調風險應該是釋放模型權重風險評估的核心部分。

English

AI developers often apply safety alignment procedures to prevent the misuse of their AI systems. For example, before Meta released Llama 2-Chat, a collection of instruction fine-tuned large language models, they invested heavily in safety training, incorporating extensive red-teaming and reinforcement learning from human feedback. However, it remains unclear how well safety training guards against model misuse when attackers have access to model weights. We explore the robustness of safety training in language models by subversively fine-tuning the public weights of Llama 2-Chat. We employ low-rank adaptation (LoRA) as an efficient fine-tuning method. With a budget of less than $200 per model and using only one GPU, we successfully undo the safety training of Llama 2-Chat models of sizes 7B, 13B, and 70B. Specifically, our fine-tuning technique significantly reduces the rate at which the model refuses to follow harmful instructions. We achieve a refusal rate below 1% for our 70B Llama 2-Chat model on two refusal benchmarks. Our fine-tuning method retains general performance, which we validate by comparing our fine-tuned models against Llama 2-Chat across two benchmarks. Additionally, we present a selection of harmful outputs produced by our models. While there is considerable uncertainty about the scope of risks from current models, it is likely that future models will have significantly more dangerous capabilities, including the ability to hack into critical infrastructure, create dangerous bio-weapons, or autonomously replicate and adapt to new environments. We show that subversive fine-tuning is practical and effective, and hence argue that evaluating risks from fine-tuning should be a core part of risk assessments for releasing model weights.