FLIRT: 피드백 루프 기반 인컨텍스트 레드 팀링

초록

경고: 본 논문은 부적절하거나 불쾌감을 줄 수 있는 내용을 포함하고 있습니다. 생성 모델이 다양한 응용 분야에서 공공 사용 가능해짐에 따라, 이러한 모델의 취약점을 테스트하고 분석하는 것이 우선 과제로 대두되었습니다. 본 연구에서는 주어진 모델을 평가하고 안전하지 않거나 부적절한 콘텐츠 생성에 대한 취약점을 노출시키는 자동화된 레드 팀링 프레임워크를 제안합니다. 우리의 프레임워크는 피드백 루프 내에서 인컨텍스트 학습을 활용하여 모델을 레드 팀링하고 안전하지 않은 콘텐츠 생성을 유발합니다. 텍스트-이미지 모델에 대한 효과적이고 다양한 적대적 프롬프트를 자동으로 학습하기 위해 다양한 인컨텍스트 공격 전략을 제안합니다. 실험 결과, 제안된 전략은 안전 기능이 강화된 Stable Diffusion(SD) 모델에서도 기존 접근법에 비해 취약점을 훨씬 더 효과적으로 노출시키는 것으로 나타났습니다. 또한, 제안된 프레임워크는 텍스트-텍스트 모델의 레드 팀링에도 효과적이며, 이전에 보고된 수치에 비해 유의미하게 높은 유해 응답 생성률을 보여줍니다.

English

Warning: this paper contains content that may be inappropriate or offensive. As generative models become available for public use in various applications, testing and analyzing vulnerabilities of these models has become a priority. Here we propose an automatic red teaming framework that evaluates a given model and exposes its vulnerabilities against unsafe and inappropriate content generation. Our framework uses in-context learning in a feedback loop to red team models and trigger them into unsafe content generation. We propose different in-context attack strategies to automatically learn effective and diverse adversarial prompts for text-to-image models. Our experiments demonstrate that compared to baseline approaches, our proposed strategy is significantly more effective in exposing vulnerabilities in Stable Diffusion (SD) model, even when the latter is enhanced with safety features. Furthermore, we demonstrate that the proposed framework is effective for red teaming text-to-text models, resulting in significantly higher toxic response generation rate compared to previously reported numbers.

FLIRT: 피드백 루프 기반 인컨텍스트 레드 팀링

FLIRT: Feedback Loop In-context Red Teaming

초록

Support