LoRA在Llama 2-Chat 70B中的微调有效地消除了安全训练。

摘要

AI开发者经常应用安全对齐程序来防止其AI系统被滥用。例如，在Meta发布Llama 2-Chat之前，这是一套经过微调的大型语言模型指令集，他们在安全培训方面投入了大量资源，包括广泛采用红队测试和从人类反馈中进行强化学习。然而，当攻击者可以访问模型权重时，安全培训对防止模型被滥用的效果仍不明确。我们通过对Llama 2-Chat的公共权重进行暗中微调来探讨语言模型安全培训的鲁棒性。我们采用低秩适应（LoRA）作为一种高效的微调方法。在每个模型不到200美元的预算和仅使用一个GPU的情况下，我们成功地撤销了尺寸为7B、13B和70B的Llama 2-Chat模型的安全培训。具体来说，我们的微调技术显著降低了模型拒绝遵循有害指令的比率。我们在两个拒绝基准测试中实现了70B Llama 2-Chat模型的拒绝率低于1%。我们的微调方法保留了通用性能，我们通过将我们的微调模型与Llama 2-Chat在两个基准测试中进行比较来验证这一点。此外，我们展示了我们的模型产生的一些有害输出。尽管目前模型的风险范围存在相当大的不确定性，但未来模型可能具有更为危险的能力，包括入侵关键基础设施、制造危险的生物武器，或者自主复制并适应新环境。我们表明，暗中微调是实用且有效的，因此我们认为，评估微调风险应成为发布模型权重的风险评估的核心部分。

English

AI developers often apply safety alignment procedures to prevent the misuse of their AI systems. For example, before Meta released Llama 2-Chat, a collection of instruction fine-tuned large language models, they invested heavily in safety training, incorporating extensive red-teaming and reinforcement learning from human feedback. However, it remains unclear how well safety training guards against model misuse when attackers have access to model weights. We explore the robustness of safety training in language models by subversively fine-tuning the public weights of Llama 2-Chat. We employ low-rank adaptation (LoRA) as an efficient fine-tuning method. With a budget of less than $200 per model and using only one GPU, we successfully undo the safety training of Llama 2-Chat models of sizes 7B, 13B, and 70B. Specifically, our fine-tuning technique significantly reduces the rate at which the model refuses to follow harmful instructions. We achieve a refusal rate below 1% for our 70B Llama 2-Chat model on two refusal benchmarks. Our fine-tuning method retains general performance, which we validate by comparing our fine-tuned models against Llama 2-Chat across two benchmarks. Additionally, we present a selection of harmful outputs produced by our models. While there is considerable uncertainty about the scope of risks from current models, it is likely that future models will have significantly more dangerous capabilities, including the ability to hack into critical infrastructure, create dangerous bio-weapons, or autonomously replicate and adapt to new environments. We show that subversive fine-tuning is practical and effective, and hence argue that evaluating risks from fine-tuning should be a core part of risk assessments for releasing model weights.

LoRA在Llama 2-Chat 70B中的微调有效地消除了安全训练。

LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B

摘要

Support