RainbowPlus: 진화적 품질-다양성 탐색을 통한 적대적 프롬프트 생성 강화

초록

대규모 언어 모델(LLMs)은 놀라운 능력을 보여주지만, 취약점을 악용하여 안전하지 않거나 편향된 출력을 생성할 수 있는 적대적 프롬프트에 취약합니다. 기존의 레드팀링 방법은 확장성 문제, 자원 집약적 요구 사항, 또는 공격 전략의 다양성 부족과 같은 한계를 겪고 있습니다. 우리는 진화적 계산에 기반한 새로운 레드팀링 프레임워크인 RainbowPlus를 제안합니다. 이 프레임워크는 MAP-Elites와 같은 고전적인 진화 알고리즘을 언어 모델에 맞게 개선한 적응형 품질-다양성(QD) 탐색을 통해 적대적 프롬프트 생성을 강화합니다. RainbowPlus는 다양한 고품질 프롬프트를 저장하기 위한 다중 요소 아카이브와 여러 프롬프트를 동시에 평가하기 위한 포괄적인 적합도 함수를 사용함으로써, 기존 QD 방법인 Rainbow Teaming에서의 단일 프롬프트 아카이브와 쌍별 비교의 한계를 극복합니다. 6개의 벤치마크 데이터셋과 4개의 오픈소스 LLM에 대해 RainbowPlus와 QD 방법을 비교한 실험 결과, RainbowPlus는 우수한 공격 성공률(ASR)과 다양성(Diverse-Score 약 0.84)을 보여주며, 최대 100배 더 많은 고유 프롬프트를 생성했습니다(예: Ministral-8B-Instruct-2410에서 10,418개 대 100개). HarmBench 데이터셋에서 12개의 LLM(10개 오픈소스, 2개 클로즈드소스)에 대해 9개의 최신 방법과 비교했을 때, RainbowPlus는 평균 ASR 81.1%를 달성하여 AutoDAN-Turbo를 3.9% 앞섰으며, 9배 더 빠른 속도를 보였습니다(1.45시간 대 13.50시간). 우리의 오픈소스 구현은 LLM 안전성 향상을 촉진하며, 취약성 평가를 위한 확장 가능한 도구를 제공합니다. 코드와 리소스는 https://github.com/knoveleng/rainbowplus에서 공개되어 있어 재현성과 향후 LLM 레드팀링 연구를 지원합니다.

English

Large Language Models (LLMs) exhibit remarkable capabilities but are susceptible to adversarial prompts that exploit vulnerabilities to produce unsafe or biased outputs. Existing red-teaming methods often face scalability challenges, resource-intensive requirements, or limited diversity in attack strategies. We propose RainbowPlus, a novel red-teaming framework rooted in evolutionary computation, enhancing adversarial prompt generation through an adaptive quality-diversity (QD) search that extends classical evolutionary algorithms like MAP-Elites with innovations tailored for language models. By employing a multi-element archive to store diverse high-quality prompts and a comprehensive fitness function to evaluate multiple prompts concurrently, RainbowPlus overcomes the constraints of single-prompt archives and pairwise comparisons in prior QD methods like Rainbow Teaming. Experiments comparing RainbowPlus to QD methods across six benchmark datasets and four open-source LLMs demonstrate superior attack success rate (ASR) and diversity (Diverse-Score approx 0.84), generating up to 100 times more unique prompts (e.g., 10,418 vs. 100 for Ministral-8B-Instruct-2410). Against nine state-of-the-art methods on the HarmBench dataset with twelve LLMs (ten open-source, two closed-source), RainbowPlus achieves an average ASR of 81.1%, surpassing AutoDAN-Turbo by 3.9%, and is 9 times faster (1.45 vs. 13.50 hours). Our open-source implementation fosters further advancements in LLM safety, offering a scalable tool for vulnerability assessment. Code and resources are publicly available at https://github.com/knoveleng/rainbowplus, supporting reproducibility and future research in LLM red-teaming.

RainbowPlus: 진화적 품질-다양성 탐색을 통한 적대적 프롬프트 생성 강화

RainbowPlus: Enhancing Adversarial Prompt Generation via Evolutionary Quality-Diversity Search

초록

Support