指令层次结构：训练LLMs以优先考虑特权指令

摘要

当今的大型语言模型(LLMs)容易受到提示注入、越狱和其他攻击的影响，这些攻击使对手能够覆盖模型的原始指令并插入恶意提示。在这项工作中，我们认为导致这些攻击的主要漏洞之一是LLMs经常将系统提示(例如来自应用程序开发人员的文本)与来自不受信任的用户和第三方的文本视为同等优先级。为了解决这个问题，我们提出了一个指令层次结构，明确定义了模型在不同优先级指令冲突时应该如何行为。然后，我们提出了一种数据生成方法来展示这种层次指令遵循行为，教导LLMs有选择地忽略较低权限的指令。我们将这种方法应用到GPT-3.5上，结果显示它极大地增加了鲁棒性，即使对于训练过程中未见过的攻击类型，同时对标准功能的影响也很小。

English

Today's LLMs are susceptible to prompt injections, jailbreaks, and other attacks that allow adversaries to overwrite a model's original instructions with their own malicious prompts. In this work, we argue that one of the primary vulnerabilities underlying these attacks is that LLMs often consider system prompts (e.g., text from an application developer) to be the same priority as text from untrusted users and third parties. To address this, we propose an instruction hierarchy that explicitly defines how models should behave when instructions of different priorities conflict. We then propose a data generation method to demonstrate this hierarchical instruction following behavior, which teaches LLMs to selectively ignore lower-privileged instructions. We apply this method to GPT-3.5, showing that it drastically increases robustness -- even for attack types not seen during training -- while imposing minimal degradations on standard capabilities.

指令层次结构：训练LLMs以优先考虑特权指令

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

摘要

Support