ChatPaper.aiChatPaper

RainbowPlus:透過演化式品質多樣性搜索提升對抗性提示生成

RainbowPlus: Enhancing Adversarial Prompt Generation via Evolutionary Quality-Diversity Search

April 21, 2025
作者: Quy-Anh Dang, Chris Ngo, Truong-Son Hy
cs.AI

摘要

大型語言模型(LLMs)展現了卓越的能力,但也容易受到對抗性提示的影響,這些提示利用其弱點產生不安全或帶有偏見的輸出。現有的紅隊測試方法常面臨可擴展性挑戰、資源密集需求或攻擊策略多樣性有限的問題。我們提出了RainbowPlus,這是一個基於演化計算的新穎紅隊測試框架,通過自適應的質量多樣性(QD)搜索增強對抗性提示的生成,該搜索擴展了如MAP-Elites等經典演化算法,並針對語言模型進行了創新。通過使用多元素檔案存儲多樣化的高質量提示,以及一個全面的適應度函數來同時評估多個提示,RainbowPlus克服了先前QD方法(如Rainbow Teaming)中單一提示檔案和成對比較的限制。在六個基準數據集和四個開源LLMs上比較RainbowPlus與QD方法的實驗顯示,其攻擊成功率(ASR)和多樣性(Diverse-Score約為0.84)均優於其他方法,生成的獨特提示數量最多可達100倍(例如,Ministral-8B-Instruct-2410的10,418個提示對比100個)。在HarmBench數據集上對十二個LLMs(十個開源,兩個閉源)與九種最先進方法的對比中,RainbowPlus的平均ASR達到81.1%,超過AutoDAN-Turbo 3.9%,且速度快了9倍(1.45小時對比13.50小時)。我們的開源實現促進了LLM安全的進一步發展,提供了一個可擴展的漏洞評估工具。代碼和資源公開於https://github.com/knoveleng/rainbowplus,支持LLM紅隊測試的可重現性和未來研究。
English
Large Language Models (LLMs) exhibit remarkable capabilities but are susceptible to adversarial prompts that exploit vulnerabilities to produce unsafe or biased outputs. Existing red-teaming methods often face scalability challenges, resource-intensive requirements, or limited diversity in attack strategies. We propose RainbowPlus, a novel red-teaming framework rooted in evolutionary computation, enhancing adversarial prompt generation through an adaptive quality-diversity (QD) search that extends classical evolutionary algorithms like MAP-Elites with innovations tailored for language models. By employing a multi-element archive to store diverse high-quality prompts and a comprehensive fitness function to evaluate multiple prompts concurrently, RainbowPlus overcomes the constraints of single-prompt archives and pairwise comparisons in prior QD methods like Rainbow Teaming. Experiments comparing RainbowPlus to QD methods across six benchmark datasets and four open-source LLMs demonstrate superior attack success rate (ASR) and diversity (Diverse-Score approx 0.84), generating up to 100 times more unique prompts (e.g., 10,418 vs. 100 for Ministral-8B-Instruct-2410). Against nine state-of-the-art methods on the HarmBench dataset with twelve LLMs (ten open-source, two closed-source), RainbowPlus achieves an average ASR of 81.1%, surpassing AutoDAN-Turbo by 3.9%, and is 9 times faster (1.45 vs. 13.50 hours). Our open-source implementation fosters further advancements in LLM safety, offering a scalable tool for vulnerability assessment. Code and resources are publicly available at https://github.com/knoveleng/rainbowplus, supporting reproducibility and future research in LLM red-teaming.

Summary

AI-Generated Summary

PDF68April 22, 2025