FLIRT：反馈环路内上下文红队行动

摘要

警告：本论文包含可能不适宜或冒犯性内容。随着生成模型在各种应用中变得可供公众使用，测试和分析这些模型的漏洞已成为当务之急。在这里，我们提出了一种自动红队框架，评估给定模型并揭示其对不安全和不适当内容生成的漏洞。我们的框架使用上下文学习在反馈循环中对模型进行红队操作，并触发其生成不安全内容。我们提出了不同的上下文攻击策略，自动学习用于文本到图像模型的有效和多样化对抗提示。我们的实验表明，与基线方法相比，我们提出的策略在暴露Stable Diffusion（SD）模型的漏洞方面显著更为有效，即使后者已加强了安全功能。此外，我们证明了所提出的框架对于对文本到文本模型进行红队操作是有效的，导致有毒响应生成速率显著高于先前报告的数字。

English

Warning: this paper contains content that may be inappropriate or offensive. As generative models become available for public use in various applications, testing and analyzing vulnerabilities of these models has become a priority. Here we propose an automatic red teaming framework that evaluates a given model and exposes its vulnerabilities against unsafe and inappropriate content generation. Our framework uses in-context learning in a feedback loop to red team models and trigger them into unsafe content generation. We propose different in-context attack strategies to automatically learn effective and diverse adversarial prompts for text-to-image models. Our experiments demonstrate that compared to baseline approaches, our proposed strategy is significantly more effective in exposing vulnerabilities in Stable Diffusion (SD) model, even when the latter is enhanced with safety features. Furthermore, we demonstrate that the proposed framework is effective for red teaming text-to-text models, resulting in significantly higher toxic response generation rate compared to previously reported numbers.

FLIRT：反馈环路内上下文红队行动

FLIRT: Feedback Loop In-context Red Teaming

摘要

Support