AdvPrompter: 大規模言語モデルのための高速適応型敵対的プロンプティング

要旨

近年、大規模言語モデル（LLMs）は目覚ましい成功を収めているが、特定のジェイルブレイキング攻撃に対して脆弱であり、不適切または有害なコンテンツの生成を引き起こす可能性がある。手動のレッドチーミングでは、例えば与えられた指示に接尾辞を追加するなどして、そのようなジェイルブレイキングを引き起こす敵対的プロンプトを見つける必要があり、非効率的で時間がかかる。一方、自動的な敵対的プロンプト生成は、しばしば意味的に無意味な攻撃を引き起こし、パープレキシティベースのフィルターで簡単に検出される可能性があるか、TargetLLMからの勾配情報を必要とするか、トークン空間での時間のかかる離散最適化プロセスのためスケールしにくい。本論文では、AdvPrompterと呼ばれる別のLLMを使用して、人間が読める敵対的プロンプトを数秒で生成する新しい方法を提案する。これは既存の最適化ベースのアプローチよりも約800倍高速である。我々は、TargetLLMの勾配にアクセスする必要のない新しいアルゴリズムを使用してAdvPrompterを訓練する。このプロセスは、2つのステップを交互に行う：(1) AdvPrompterの予測を最適化して高品質のターゲット敵対的接尾辞を生成し、(2) 生成された敵対的接尾辞を使用してAdvPrompterを低ランクでファインチューニングする。訓練されたAdvPrompterは、入力指示の意味を変えずにそれを覆い隠す接尾辞を生成し、TargetLLMが有害な応答をするように誘導する。人気のあるオープンソースのTargetLLMでの実験結果は、AdvBenchデータセットにおいて最先端の結果を示し、クローズドソースのブラックボックスLLM APIにも転移する。さらに、AdvPrompterによって生成された合成データセットでファインチューニングすることで、LLMをジェイルブレイキング攻撃に対してより堅牢にしつつ、性能（高いMMLUスコア）を維持できることを実証する。

English

While recently Large Language Models (LLMs) have achieved remarkable successes, they are vulnerable to certain jailbreaking attacks that lead to generation of inappropriate or harmful content. Manual red-teaming requires finding adversarial prompts that cause such jailbreaking, e.g. by appending a suffix to a given instruction, which is inefficient and time-consuming. On the other hand, automatic adversarial prompt generation often leads to semantically meaningless attacks that can easily be detected by perplexity-based filters, may require gradient information from the TargetLLM, or do not scale well due to time-consuming discrete optimization processes over the token space. In this paper, we present a novel method that uses another LLM, called the AdvPrompter, to generate human-readable adversarial prompts in seconds, sim800times faster than existing optimization-based approaches. We train the AdvPrompter using a novel algorithm that does not require access to the gradients of the TargetLLM. This process alternates between two steps: (1) generating high-quality target adversarial suffixes by optimizing the AdvPrompter predictions, and (2) low-rank fine-tuning of the AdvPrompter with the generated adversarial suffixes. The trained AdvPrompter generates suffixes that veil the input instruction without changing its meaning, such that the TargetLLM is lured to give a harmful response. Experimental results on popular open source TargetLLMs show state-of-the-art results on the AdvBench dataset, that also transfer to closed-source black-box LLM APIs. Further, we demonstrate that by fine-tuning on a synthetic dataset generated by AdvPrompter, LLMs can be made more robust against jailbreaking attacks while maintaining performance, i.e. high MMLU scores.

AdvPrompter: 大規模言語モデルのための高速適応型敵対的プロンプティング

AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs

要旨

Support