FLIRT：反饋環路內容紅隊行動。

摘要

警告：本文含有可能不適當或冒犯性的內容。隨著生成模型在各種應用中開始對公眾開放，測試和分析這些模型的弱點已成為當務之急。在這裡，我們提出了一個自動紅隊框架，評估給定模型並揭示其對不安全和不當內容生成的弱點。我們的框架使用上下文學習在反饋循環中紅隊模型，並觸發它們生成不安全的內容。我們提出了不同的上下文攻擊策略，以自動學習對文本到圖像模型有效且多樣化的對抗提示。我們的實驗表明，與基準方法相比，我們提出的策略在揭示 Stable Diffusion（SD）模型的弱點方面顯著更為有效，即使後者加強了安全功能。此外，我們證明了所提出的框架對於紅隊文本到文本模型是有效的，導致明顯更高的有毒回應生成率，與先前報告的數字相比。

English

Warning: this paper contains content that may be inappropriate or offensive. As generative models become available for public use in various applications, testing and analyzing vulnerabilities of these models has become a priority. Here we propose an automatic red teaming framework that evaluates a given model and exposes its vulnerabilities against unsafe and inappropriate content generation. Our framework uses in-context learning in a feedback loop to red team models and trigger them into unsafe content generation. We propose different in-context attack strategies to automatically learn effective and diverse adversarial prompts for text-to-image models. Our experiments demonstrate that compared to baseline approaches, our proposed strategy is significantly more effective in exposing vulnerabilities in Stable Diffusion (SD) model, even when the latter is enhanced with safety features. Furthermore, we demonstrate that the proposed framework is effective for red teaming text-to-text models, resulting in significantly higher toxic response generation rate compared to previously reported numbers.

FLIRT：反饋環路內容紅隊行動。

FLIRT: Feedback Loop In-context Red Teaming

摘要

Support