明示的に有害なプロンプトを用いた商用ブラックボックスLLMの脱獄

要旨

ジェイルブレイク攻撃の評価は、プロンプトが明らかに有害でない場合や有害な出力を誘発しない場合に困難を伴う。残念ながら、既存のレッドチーミングデータセットの多くは、このような不適切なプロンプトを含んでいる。攻撃を正確に評価するためには、これらのデータセットを悪意のある内容について評価し、クリーニングする必要がある。しかし、既存の悪意のあるコンテンツ検出方法は、手作業によるアノテーションに依存するか、あるいは大規模言語モデル（LLM）に依存しており、後者は有害なタイプにおいて一貫した精度を欠いている。精度と効率のバランスを取るために、我々はLLMベースのアノテーションと最小限の人的監視を組み合わせたハイブリッド評価フレームワーク「MDH（Malicious content Detection based on LLMs with Human assistance）」を提案し、データセットのクリーニングとジェイルブレイクされた応答の検出に適用する。さらに、よく練られた開発者メッセージがジェイルブレイクの成功率を大幅に向上させることを発見し、これに基づいて2つの新しい戦略を提案する。一つはコンテキストシミュレーションを活用する「D-Attack」、もう一つはハイジャックされた思考の連鎖を取り入れた「DH-CoT」である。コード、データセット、判定結果、および検出結果は、GitHubリポジトリ（https://github.com/AlienZhang1996/DH-CoT）で公開される。

English

Evaluating jailbreak attacks is challenging when prompts are not overtly harmful or fail to induce harmful outputs. Unfortunately, many existing red-teaming datasets contain such unsuitable prompts. To evaluate attacks accurately, these datasets need to be assessed and cleaned for maliciousness. However, existing malicious content detection methods rely on either manual annotation, which is labor-intensive, or large language models (LLMs), which have inconsistent accuracy in harmful types. To balance accuracy and efficiency, we propose a hybrid evaluation framework named MDH (Malicious content Detection based on LLMs with Human assistance) that combines LLM-based annotation with minimal human oversight, and apply it to dataset cleaning and detection of jailbroken responses. Furthermore, we find that well-crafted developer messages can significantly boost jailbreak success, leading us to propose two new strategies: D-Attack, which leverages context simulation, and DH-CoT, which incorporates hijacked chains of thought. The Codes, datasets, judgements, and detection results will be released in github repository: https://github.com/AlienZhang1996/DH-CoT.

明示的に有害なプロンプトを用いた商用ブラックボックスLLMの脱獄

Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts

要旨

Support