ChatPaper.aiChatPaper

病毒:為大型語言模型繞過護欄調節的有害微調攻擊

Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation

January 29, 2025
作者: Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Ling Liu
cs.AI

摘要

最近的研究顯示,大型語言模型(LLMs)容易受到有害的微調攻擊影響 - 模型在微調幾個有害樣本後失去了其安全對齊能力。為了風險緩解,通常會使用一個護欄來在微調之前過濾有害樣本。通過設計一種新的紅隊方法,我們在本文中展示,僅依賴於護欄進行數據過濾並不可靠。我們提出的攻擊方法被稱為病毒,通過輕微修改有害數據輕鬆地繞過了護欄的過濾。實驗結果顯示,經病毒優化的有害數據在高達100%的洩漏率下無法被護欄檢測到,同時可以實現卓越的攻擊性能。最後,我們希望通過本文傳達的關鍵信息是:將護欄過濾視為對抗有害微調攻擊的救命稻草是魯莽的,因為它無法解決預先訓練的LLMs固有的安全問題。我們的程式碼可在 https://github.com/git-disl/Virus 找到。
English
Recent research shows that Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks -- models lose their safety alignment ability after fine-tuning on a few harmful samples. For risk mitigation, a guardrail is typically used to filter out harmful samples before fine-tuning. By designing a new red-teaming method, we in this paper show that purely relying on the moderation guardrail for data filtration is not reliable. Our proposed attack method, dubbed Virus, easily bypasses the guardrail moderation by slightly modifying the harmful data. Experimental results show that the harmful data optimized by Virus is not detectable by the guardrail with up to 100\% leakage ratio, and can simultaneously achieve superior attack performance. Finally, the key message we want to convey through this paper is that: it is reckless to consider guardrail moderation as a clutch at straws towards harmful fine-tuning attack, as it cannot solve the inherent safety issue of the pre-trained LLMs. Our code is available at https://github.com/git-disl/Virus

Summary

AI-Generated Summary

PDF93January 30, 2025