Ferret：利用基於獎勵的評分技術進行更快速和有效的自動化紅隊行動

摘要

在當今時代，大型語言模型（LLMs）被整合到眾多實際應用中，確保其安全性和韌性對於負責任的人工智慧應用至關重要。自動紅隊方法在這一過程中扮演關鍵角色，通過生成對抗攻擊來識別和減輕這些模型中潛在的漏洞。然而，現有方法常常面臨性能緩慢、分類多樣性有限和資源需求高的困難。最近提出的Rainbow Teaming方法通過將對抗提示生成定義為一種質量多樣性搜索，解決了多樣性挑戰，但仍然速度較慢，需要大量微調的變異器才能實現最佳性能。為了克服這些限制，我們提出了Ferret，這是一種新穎的方法，它在Rainbow Teaming的基礎上生成每次迭代多個對抗提示變異，並使用評分函數來排名和選擇最有效的對抗提示。我們探索了各種評分函數，包括獎勵模型、Llama Guard和LLM作為評判，以根據潛在危害對對抗變異進行排名，從而提高尋找有害變異的效率。我們的結果表明，利用獎勵模型作為評分函數的Ferret將整體攻擊成功率（ASR）提高到95％，比Rainbow Teaming高出46％。此外，與基準相比，Ferret將實現90％ASR所需的時間減少了15.2％，並生成可轉移的對抗提示，即對更大型LLMs有效。我們的代碼可在https://github.com/declare-lab/ferret找到。

English

In today's era, where large language models (LLMs) are integrated into numerous real-world applications, ensuring their safety and robustness is crucial for responsible AI usage. Automated red-teaming methods play a key role in this process by generating adversarial attacks to identify and mitigate potential vulnerabilities in these models. However, existing methods often struggle with slow performance, limited categorical diversity, and high resource demands. While Rainbow Teaming, a recent approach, addresses the diversity challenge by framing adversarial prompt generation as a quality-diversity search, it remains slow and requires a large fine-tuned mutator for optimal performance. To overcome these limitations, we propose Ferret, a novel approach that builds upon Rainbow Teaming by generating multiple adversarial prompt mutations per iteration and using a scoring function to rank and select the most effective adversarial prompt. We explore various scoring functions, including reward models, Llama Guard, and LLM-as-a-judge, to rank adversarial mutations based on their potential harm to improve the efficiency of the search for harmful mutations. Our results demonstrate that Ferret, utilizing a reward model as a scoring function, improves the overall attack success rate (ASR) to 95%, which is 46% higher than Rainbow Teaming. Additionally, Ferret reduces the time needed to achieve a 90% ASR by 15.2% compared to the baseline and generates adversarial prompts that are transferable i.e. effective on other LLMs of larger size. Our codes are available at https://github.com/declare-lab/ferret.

Ferret：利用基於獎勵的評分技術進行更快速和有效的自動化紅隊行動

Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique

摘要

Summary

Support

Support