越獄作為獎勵錯誤規範問題

摘要

大型語言模型（LLMs）的廣泛應用引起了對其安全性和可靠性的擔憂，特別是對它們易受對抗攻擊的脆弱性。在本文中，我們提出了一個新的觀點，認為這種脆弱性是由於對齊過程中獎勵錯誤規範所致。我們引入了一個度量標準 ReGap 來量化獎勵錯誤規範的程度，並展示了它在檢測有害的後門提示方面的有效性和穩健性。基於這些見解，我們提出了一個名為 ReMiss 的系統，用於自動紅隊測試，針對各種目標對齊的LLMs生成對抗性提示。ReMiss 在 AdvBench 基準測試中實現了最先進的攻擊成功率，同時保留了生成提示的人類可讀性。詳細分析突出了所提出的獎勵錯誤規範目標相對於先前方法帶來的獨特優勢。

English

The widespread adoption of large language models (LLMs) has raised concerns about their safety and reliability, particularly regarding their vulnerability to adversarial attacks. In this paper, we propose a novel perspective that attributes this vulnerability to reward misspecification during the alignment process. We introduce a metric ReGap to quantify the extent of reward misspecification and demonstrate its effectiveness and robustness in detecting harmful backdoor prompts. Building upon these insights, we present ReMiss, a system for automated red teaming that generates adversarial prompts against various target aligned LLMs. ReMiss achieves state-of-the-art attack success rates on the AdvBench benchmark while preserving the human readability of the generated prompts. Detailed analysis highlights the unique advantages brought by the proposed reward misspecification objective compared to previous methods.

越獄作為獎勵錯誤規範問題

Jailbreaking as a Reward Misspecification Problem

摘要

Summary

Support

Support