MART:通过多轮自动红队操作改进LLM安全
MART: Improving LLM Safety with Multi-round Automatic Red-Teaming
November 13, 2023
作者: Suyu Ge, Chunting Zhou, Rui Hou, Madian Khabsa, Yi-Chia Wang, Qifan Wang, Jiawei Han, Yuning Mao
cs.AI
摘要
红队测试是减轻大型语言模型(LLMs)中不安全行为的常见做法,涉及彻底评估LLMs以识别潜在缺陷,并用负责任和准确的响应加以解决。虽然有效,但手动红队测试成本高,现有的自动红队测试通常发现安全风险但未加以解决。在本文中,我们提出了一种多轮自动红队测试(MART)方法,结合了自动对抗提示编写和安全响应生成,显著提高了红队测试的可扩展性和目标LLM的安全性。具体而言,对抗LLM和目标LLM以迭代方式相互作用,其中对抗LLM旨在生成具有挑战性的提示,从目标LLM中引发不安全响应,而目标LLM则通过这些对抗提示上的安全对齐数据进行微调。在每一轮中,对抗LLM对更新后的目标LLM制定更好的攻击策略,同时目标LLM也通过安全微调来改进自身。在对抗提示基准测试中,具有有限安全对齐的LLM的违规率在进行4轮MART后降低了高达84.7%,达到了与具有广泛对抗提示编写的LLMs相当的性能。值得注意的是,模型在非对抗提示上的帮助性在迭代过程中保持稳定,表明目标LLM在遵循指令方面保持了良好的性能。
English
Red-teaming is a common practice for mitigating unsafe behaviors in Large
Language Models (LLMs), which involves thoroughly assessing LLMs to identify
potential flaws and addressing them with responsible and accurate responses.
While effective, manual red-teaming is costly, and existing automatic
red-teaming typically discovers safety risks without addressing them. In this
paper, we propose a Multi-round Automatic Red-Teaming (MART) method, which
incorporates both automatic adversarial prompt writing and safe response
generation, significantly increasing red-teaming scalability and the safety of
the target LLM. Specifically, an adversarial LLM and a target LLM interplay
with each other in an iterative manner, where the adversarial LLM aims to
generate challenging prompts that elicit unsafe responses from the target LLM,
while the target LLM is fine-tuned with safety aligned data on these
adversarial prompts. In each round, the adversarial LLM crafts better attacks
on the updated target LLM, while the target LLM also improves itself through
safety fine-tuning. On adversarial prompt benchmarks, the violation rate of an
LLM with limited safety alignment reduces up to 84.7% after 4 rounds of MART,
achieving comparable performance to LLMs with extensive adversarial prompt
writing. Notably, model helpfulness on non-adversarial prompts remains stable
throughout iterations, indicating the target LLM maintains strong performance
on instruction following.