명령어 계층 구조: LLM에게 우선 순위 명령어를 학습시키기

초록

오늘날의 대형 언어 모델(LLM)은 프롬프트 주입, 탈옥(jailbreak), 그리고 공격자가 모델의 원래 지시사항을 악의적인 프롬프트로 덮어쓸 수 있는 다양한 공격에 취약합니다. 본 연구에서는 이러한 공격들의 주요 취약점 중 하나가 LLM이 종종 시스템 프롬프트(예: 애플리케이션 개발자의 텍스트)를 신뢰할 수 없는 사용자 및 제3자의 텍스트와 동일한 우선순위로 간주한다는 점이라고 주장합니다. 이를 해결하기 위해, 우리는 모델이 서로 다른 우선순위의 지시사항이 충돌할 때 어떻게 행동해야 하는지를 명시적으로 정의하는 지시사항 계층 구조를 제안합니다. 그런 다음, 이 계층적 지시사항 준수 행동을 보여주기 위한 데이터 생성 방법을 제안하며, 이는 LLM이 낮은 권한의 지시사항을 선택적으로 무시하도록 가르칩니다. 우리는 이 방법을 GPT-3.5에 적용하여, 훈련 중에 보지 못한 공격 유형에 대해서도 견고성을 크게 증가시키는 동시에 표준 기능에 미치는 성능 저하를 최소화함을 보여줍니다.

English

Today's LLMs are susceptible to prompt injections, jailbreaks, and other attacks that allow adversaries to overwrite a model's original instructions with their own malicious prompts. In this work, we argue that one of the primary vulnerabilities underlying these attacks is that LLMs often consider system prompts (e.g., text from an application developer) to be the same priority as text from untrusted users and third parties. To address this, we propose an instruction hierarchy that explicitly defines how models should behave when instructions of different priorities conflict. We then propose a data generation method to demonstrate this hierarchical instruction following behavior, which teaches LLMs to selectively ignore lower-privileged instructions. We apply this method to GPT-3.5, showing that it drastically increases robustness -- even for attack types not seen during training -- while imposing minimal degradations on standard capabilities.

명령어 계층 구조: LLM에게 우선 순위 명령어를 학습시키기

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

초록

Support