FLIRT: フィードバックループを用いたインコンテキスト・レッドチーミング

要旨

警告：本論文には不適切または不快な内容が含まれている可能性があります。生成モデルが様々なアプリケーションで一般利用可能になるにつれ、これらのモデルの脆弱性をテストし分析することが優先課題となっています。本稿では、与えられたモデルを評価し、安全でない不適切なコンテンツ生成に対する脆弱性を明らかにする自動レッドチーミングフレームワークを提案します。本フレームワークは、フィードバックループ内でのインコンテキスト学習を活用し、モデルをレッドチーミングして安全でないコンテンツ生成を引き起こします。テキストから画像への変換モデルに対して、効果的で多様な敵対的プロンプトを自動的に学習するための様々なインコンテキスト攻撃戦略を提案します。実験の結果、提案手法はベースラインアプローチと比較して、Stable Diffusion（SD）モデルの脆弱性を明らかにする点で大幅に効果的であり、SDモデルが安全機能を強化されている場合でも同様の結果が得られることが示されました。さらに、提案フレームワークはテキストからテキストへの変換モデルのレッドチーミングにも有効であり、これまで報告された数値と比較して、有毒な応答生成率が大幅に高くなることを実証しました。

English

Warning: this paper contains content that may be inappropriate or offensive. As generative models become available for public use in various applications, testing and analyzing vulnerabilities of these models has become a priority. Here we propose an automatic red teaming framework that evaluates a given model and exposes its vulnerabilities against unsafe and inappropriate content generation. Our framework uses in-context learning in a feedback loop to red team models and trigger them into unsafe content generation. We propose different in-context attack strategies to automatically learn effective and diverse adversarial prompts for text-to-image models. Our experiments demonstrate that compared to baseline approaches, our proposed strategy is significantly more effective in exposing vulnerabilities in Stable Diffusion (SD) model, even when the latter is enhanced with safety features. Furthermore, we demonstrate that the proposed framework is effective for red teaming text-to-text models, resulting in significantly higher toxic response generation rate compared to previously reported numbers.

FLIRT: フィードバックループを用いたインコンテキスト・レッドチーミング

FLIRT: Feedback Loop In-context Red Teaming

要旨

Support