Ferret：基于奖励的评分技术的快速高效自动化红队行动

摘要

在当今时代，大型语言模型（LLMs）被整合到许多实际应用中，确保它们的安全性和稳健性对于负责任的AI使用至关重要。自动化红队方法在这一过程中发挥关键作用，通过生成对抗性攻击来识别和减轻这些模型中潜在的漏洞。然而，现有方法通常面临性能缓慢、分类多样性有限和资源需求高的困难。最近提出的Rainbow Teaming方法通过将对抗性提示生成构建为一个质量多样性搜索来解决多样性挑战，但仍然速度较慢，并且需要一个大型精细调整的变异器才能实现最佳性能。为了克服这些限制，我们提出了一种新方法Ferret，它在Rainbow Teaming的基础上生成每次迭代多个对抗性提示变异，并使用评分函数对最有效的对抗性提示进行排名和选择。我们探讨了各种评分函数，包括奖励模型、Llama Guard和LLM作为评判者，以根据潜在危害对对抗性变异进行排名，以提高搜索有害变异的效率。我们的结果表明，利用奖励模型作为评分函数的Ferret将整体攻击成功率（ASR）提高到95％，比Rainbow Teaming高出46％。此外，与基准相比，Ferret将实现90％ASR所需的时间减少了15.2％，并生成可转移的对抗性提示，即对更大型LLMs有效。我们的代码可在https://github.com/declare-lab/ferret 上找到。

English

In today's era, where large language models (LLMs) are integrated into numerous real-world applications, ensuring their safety and robustness is crucial for responsible AI usage. Automated red-teaming methods play a key role in this process by generating adversarial attacks to identify and mitigate potential vulnerabilities in these models. However, existing methods often struggle with slow performance, limited categorical diversity, and high resource demands. While Rainbow Teaming, a recent approach, addresses the diversity challenge by framing adversarial prompt generation as a quality-diversity search, it remains slow and requires a large fine-tuned mutator for optimal performance. To overcome these limitations, we propose Ferret, a novel approach that builds upon Rainbow Teaming by generating multiple adversarial prompt mutations per iteration and using a scoring function to rank and select the most effective adversarial prompt. We explore various scoring functions, including reward models, Llama Guard, and LLM-as-a-judge, to rank adversarial mutations based on their potential harm to improve the efficiency of the search for harmful mutations. Our results demonstrate that Ferret, utilizing a reward model as a scoring function, improves the overall attack success rate (ASR) to 95%, which is 46% higher than Rainbow Teaming. Additionally, Ferret reduces the time needed to achieve a 90% ASR by 15.2% compared to the baseline and generates adversarial prompts that are transferable i.e. effective on other LLMs of larger size. Our codes are available at https://github.com/declare-lab/ferret.

Ferret：基于奖励的评分技术的快速高效自动化红队行动

Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique

摘要

Support