指令層次結構：訓練LLMs以優先處理特權指令

摘要

當前的LLM容易受到提示注入、越獄和其他攻擊的影響，這些攻擊使對手能夠覆蓋模型的原始指令並插入惡意提示。在這項工作中，我們認為導致這些攻擊的主要弱點之一是LLM通常將系統提示（例如，應用程式開發人員提供的文本）視為與不受信任的用戶和第三方提供的文本具有相同的優先級。為了解決這個問題，我們提出了一種指令層次結構，明確定義了模型在不同優先級指令衝突時應該如何行為。然後，我們提出了一種數據生成方法，以展示這種階層式指令遵循行為，教導LLM有選擇性地忽略較低特權的指令。我們將此方法應用於GPT-3.5，顯示它顯著提高了魯棒性，即使對於在訓練期間未見過的攻擊類型，同時對標準功能的影響也很小。

English

Today's LLMs are susceptible to prompt injections, jailbreaks, and other attacks that allow adversaries to overwrite a model's original instructions with their own malicious prompts. In this work, we argue that one of the primary vulnerabilities underlying these attacks is that LLMs often consider system prompts (e.g., text from an application developer) to be the same priority as text from untrusted users and third parties. To address this, we propose an instruction hierarchy that explicitly defines how models should behave when instructions of different priorities conflict. We then propose a data generation method to demonstrate this hierarchical instruction following behavior, which teaches LLMs to selectively ignore lower-privileged instructions. We apply this method to GPT-3.5, showing that it drastically increases robustness -- even for attack types not seen during training -- while imposing minimal degradations on standard capabilities.

指令層次結構：訓練LLMs以優先處理特權指令

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

摘要

Support