페럿: 보상 기반 스코어링 기술을 활용한 빠르고 효과적인 자동화된 레드팀 구성

초록

오늘날 대형 언어 모델(Large Language Models, LLMs)이 다양한 실제 응용 프로그램에 통합되는 시대에는, 그 안전성과 견고성을 보장하는 것이 책임 있는 AI 사용에 중요합니다. 자동 적대적 팀팀 방법은 이러한 모델의 잠재적 취약점을 식별하고 완화하기 위해 적대적 공격을 생성함으로써 이 프로세스에서 중요한 역할을 합니다. 그러나 기존 방법은 종종 성능이 느리고 범주 다양성이 제한되며 높은 자원 요구가 있습니다. 최근 접근 방식인 무지개 팀팀은 적대적 프롬프트 생성을 품질-다양성 탐색으로 구성함으로써 다양성 문제에 대처하지만 여전히 느리며 최적 성능을 위해 큰 세밀 조정자가 필요합니다. 이러한 제한을 극복하기 위해 우리는 Ferret이라는 새로운 방법을 제안합니다. Ferret은 Rainbow Teaming을 기반으로 하여 각 반복마다 여러 적대적 프롬프트 변이를 생성하고 점수 함수를 사용하여 가장 효과적인 적대적 프롬프트를 순위 매기고 선택합니다. 우리는 보상 모델, Llama Guard 및 LLM-판사 등 다양한 점수 함수를 탐구하여 잠재적 피해에 따라 적대적 변이를 순위 매겨 유해한 변이를 탐색하는 효율성을 향상시킵니다. 우리의 결과는 Ferret이 점수 함수로서 보상 모델을 활용함으로써 전체 공격 성공률(Attack Success Rate, ASR)을 95%로 향상시키며, 이는 Rainbow Teaming보다 46% 높습니다. 또한 Ferret은 기준과 비교하여 90% ASR을 달성하는 데 필요한 시간을 15.2% 줄이고, 더 큰 크기의 다른 LLM에서도 효과적인 전이 가능한 적대적 프롬프트를 생성합니다. 우리의 코드는 https://github.com/declare-lab/ferret에서 사용할 수 있습니다.

English

In today's era, where large language models (LLMs) are integrated into numerous real-world applications, ensuring their safety and robustness is crucial for responsible AI usage. Automated red-teaming methods play a key role in this process by generating adversarial attacks to identify and mitigate potential vulnerabilities in these models. However, existing methods often struggle with slow performance, limited categorical diversity, and high resource demands. While Rainbow Teaming, a recent approach, addresses the diversity challenge by framing adversarial prompt generation as a quality-diversity search, it remains slow and requires a large fine-tuned mutator for optimal performance. To overcome these limitations, we propose Ferret, a novel approach that builds upon Rainbow Teaming by generating multiple adversarial prompt mutations per iteration and using a scoring function to rank and select the most effective adversarial prompt. We explore various scoring functions, including reward models, Llama Guard, and LLM-as-a-judge, to rank adversarial mutations based on their potential harm to improve the efficiency of the search for harmful mutations. Our results demonstrate that Ferret, utilizing a reward model as a scoring function, improves the overall attack success rate (ASR) to 95%, which is 46% higher than Rainbow Teaming. Additionally, Ferret reduces the time needed to achieve a 90% ASR by 15.2% compared to the baseline and generates adversarial prompts that are transferable i.e. effective on other LLMs of larger size. Our codes are available at https://github.com/declare-lab/ferret.

페럿: 보상 기반 스코어링 기술을 활용한 빠르고 효과적인 자동화된 레드팀 구성

Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique

초록

Support