報酬関数の誤設定問題としてのジェイルブレイク

要旨

大規模言語モデル（LLM）の広範な採用は、その安全性と信頼性、特に敵対的攻撃に対する脆弱性に関する懸念を引き起こしています。本論文では、この脆弱性をアライメントプロセスにおける報酬の誤指定に帰因させる新たな視点を提案します。我々は、報酬の誤指定の程度を定量化する指標ReGapを導入し、有害なバックドアプロンプトを検出する上でのその有効性と頑健性を実証します。これらの知見に基づき、様々なターゲットアライメントLLMに対する敵対的プロンプトを生成する自動化されたレッドチーミングシステムReMissを提示します。ReMissは、AdvBenchベンチマークにおいて最先端の攻撃成功率を達成しつつ、生成されたプロンプトの人間による可読性を維持します。詳細な分析により、提案された報酬誤指定の目的関数が従来の手法と比較してもたらす独自の利点が明らかになりました。

English

The widespread adoption of large language models (LLMs) has raised concerns about their safety and reliability, particularly regarding their vulnerability to adversarial attacks. In this paper, we propose a novel perspective that attributes this vulnerability to reward misspecification during the alignment process. We introduce a metric ReGap to quantify the extent of reward misspecification and demonstrate its effectiveness and robustness in detecting harmful backdoor prompts. Building upon these insights, we present ReMiss, a system for automated red teaming that generates adversarial prompts against various target aligned LLMs. ReMiss achieves state-of-the-art attack success rates on the AdvBench benchmark while preserving the human readability of the generated prompts. Detailed analysis highlights the unique advantages brought by the proposed reward misspecification objective compared to previous methods.

報酬関数の誤設定問題としてのジェイルブレイク

Jailbreaking as a Reward Misspecification Problem

要旨

Support