越狱作为一种奖励错误规范问题
Jailbreaking as a Reward Misspecification Problem
June 20, 2024
作者: Zhihui Xie, Jiahui Gao, Lei Li, Zhenguo Li, Qi Liu, Lingpeng Kong
cs.AI
摘要
大型语言模型(LLMs)的广泛采用引发了人们对其安全性和可靠性的担忧,特别是对其易受敌对攻击的脆弱性。在本文中,我们提出了一种新颖的观点,将这种脆弱性归因于对齐过程中奖励错误规定。我们引入了一个度量标准 ReGap 来量化奖励错误规定的程度,并展示了它在检测有害后门提示方面的有效性和稳健性。基于这些见解,我们提出了 ReMiss,这是一个自动化红队系统,针对各种目标对齐的LLMs生成敌对提示。ReMiss 在 AdvBench 基准测试中实现了最先进的攻击成功率,同时保持了生成提示的人类可读性。详细分析突出了所提出的奖励错误规定目标相对于先前方法带来的独特优势。
English
The widespread adoption of large language models (LLMs) has raised concerns
about their safety and reliability, particularly regarding their vulnerability
to adversarial attacks. In this paper, we propose a novel perspective that
attributes this vulnerability to reward misspecification during the alignment
process. We introduce a metric ReGap to quantify the extent of reward
misspecification and demonstrate its effectiveness and robustness in detecting
harmful backdoor prompts. Building upon these insights, we present ReMiss, a
system for automated red teaming that generates adversarial prompts against
various target aligned LLMs. ReMiss achieves state-of-the-art attack success
rates on the AdvBench benchmark while preserving the human readability of the
generated prompts. Detailed analysis highlights the unique advantages brought
by the proposed reward misspecification objective compared to previous methods.Summary
AI-Generated Summary