MART：通過多輪自動紅隊測試改進LLM安全性

摘要

紅隊測試是在大型語言模型（LLMs）中減少不安全行為的常見做法，涉及徹底評估LLMs以識別潛在缺陷並以負責任和準確的回應加以解決。儘管有效，手動紅隊測試成本高昂，現有的自動紅隊測試通常發現安全風險但未加以解決。在本文中，我們提出了一種多輪自動紅隊測試（MART）方法，該方法結合了自動對抗提示撰寫和安全回應生成，顯著提高了紅隊測試的可擴展性和目標LLM的安全性。具體而言，對抗性LLM和目標LLM以迭代方式相互作用，其中對抗性LLM旨在生成具挑戰性的提示，引發目標LLM的不安全回應，而目標LLM則通過這些對抗性提示上的安全對齊數據進行微調。在每一輪中，對抗性LLM對更新的目標LLM製定更好的攻擊策略，同時目標LLM也通過安全微調來改進自身。在對抗性提示基準上，在經過4輪MART後，具有有限安全對齊的LLM的違規率降低了高達84.7％，實現了與具有廣泛對抗性提示撰寫的LLMs相當的性能。值得注意的是，在迭代過程中，模型對非對抗性提示的幫助性保持穩定，表明目標LLM在遵循指示方面保持著良好的性能。

English

Red-teaming is a common practice for mitigating unsafe behaviors in Large Language Models (LLMs), which involves thoroughly assessing LLMs to identify potential flaws and addressing them with responsible and accurate responses. While effective, manual red-teaming is costly, and existing automatic red-teaming typically discovers safety risks without addressing them. In this paper, we propose a Multi-round Automatic Red-Teaming (MART) method, which incorporates both automatic adversarial prompt writing and safe response generation, significantly increasing red-teaming scalability and the safety of the target LLM. Specifically, an adversarial LLM and a target LLM interplay with each other in an iterative manner, where the adversarial LLM aims to generate challenging prompts that elicit unsafe responses from the target LLM, while the target LLM is fine-tuned with safety aligned data on these adversarial prompts. In each round, the adversarial LLM crafts better attacks on the updated target LLM, while the target LLM also improves itself through safety fine-tuning. On adversarial prompt benchmarks, the violation rate of an LLM with limited safety alignment reduces up to 84.7% after 4 rounds of MART, achieving comparable performance to LLMs with extensive adversarial prompt writing. Notably, model helpfulness on non-adversarial prompts remains stable throughout iterations, indicating the target LLM maintains strong performance on instruction following.

MART：通過多輪自動紅隊測試改進LLM安全性

MART: Improving LLM Safety with Multi-round Automatic Red-Teaming

摘要

Support