명령어 튜닝된 대형 언어 모델을 위한 가상 프롬프트 주입

초록

우리는 인스트럭션 튜닝된 대형 언어 모델(LLMs)을 대상으로 한 가상 프롬프트 주입(Virtual Prompt Injection, VPI)을 소개한다. VPI는 공격자가 지정한 가상 프롬프트를 통해 모델 입력에 명시적인 주입 없이도 특정 트리거 시나리오에서 모델 행동을 조종할 수 있게 한다. 예를 들어, 조 바이든 관련 인스트럭션에 대해 "조 바이든을 부정적으로 묘사하라."라는 가상 프롬프트로 LLM이 손상된 경우, 이 모델을 배포한 서비스는 조 바이든 관련 사용자 쿼리를 처리할 때 편향된 견해를 전파하게 된다. VPI는 주로 두 가지 이유로 특히 해롭다. 첫째, 공격자는 다양한 가상 프롬프트를 정의함으로써 LLM의 행동을 세밀하게 제어할 수 있으며, 이는 LLM이 인스트럭션을 잘 따르는 능력을 악용한 것이다. 둘째, 이러한 제어는 모델이 서비스 중일 때 공격자의 개입 없이도 이루어지므로 지속적인 공격이 가능하다. 이러한 위협을 입증하기 위해, 우리는 모델의 인스트럭션 튜닝 데이터를 오염시켜 VPI를 수행하는 간단한 방법을 제안한다. 우리가 제안한 방법은 VPI로 LLM을 조종하는 데 매우 효과적임을 발견했다. 예를 들어, 인스트럭션 튜닝 데이터에 단 52개의 오염된 예제(훈련 데이터 크기의 0.1%)만 주입해도, 훈련된 모델이 조 바이든 관련 쿼리에 대해 부정적인 응답을 하는 비율이 0%에서 40%로 변경되었다. 따라서 우리는 인스트럭션 튜닝 데이터의 무결성을 보장할 필요성을 강조하며, 적은 양의 오염된 데이터도 배포된 모델에 은밀하고 지속적인 피해를 줄 수 있음을 지적한다. 또한, 우리는 가능한 방어 방법을 탐구하고 데이터 필터링이 오염 공격에 효과적으로 대응할 수 있는 방법임을 확인했다. 우리의 프로젝트 페이지는 https://poison-llm.github.io에서 확인할 수 있다.

English

We present Virtual Prompt Injection (VPI) for instruction-tuned Large Language Models (LLMs). VPI allows an attacker-specified virtual prompt to steer the model behavior under specific trigger scenario without any explicit injection in model input. For instance, if an LLM is compromised with the virtual prompt "Describe Joe Biden negatively." for Joe Biden-related instructions, then any service deploying this model will propagate biased views when handling user queries related to Joe Biden. VPI is especially harmful for two primary reasons. Firstly, the attacker can take fine-grained control over LLM behaviors by defining various virtual prompts, exploiting LLMs' proficiency in following instructions. Secondly, this control is achieved without any interaction from the attacker while the model is in service, leading to persistent attack. To demonstrate the threat, we propose a simple method for performing VPI by poisoning the model's instruction tuning data. We find that our proposed method is highly effective in steering the LLM with VPI. For example, by injecting only 52 poisoned examples (0.1% of the training data size) into the instruction tuning data, the percentage of negative responses given by the trained model on Joe Biden-related queries change from 0% to 40%. We thus highlight the necessity of ensuring the integrity of the instruction-tuning data as little poisoned data can cause stealthy and persistent harm to the deployed model. We further explore the possible defenses and identify data filtering as an effective way to defend against the poisoning attacks. Our project page is available at https://poison-llm.github.io.

명령어 튜닝된 대형 언어 모델을 위한 가상 프롬프트 주입

Virtual Prompt Injection for Instruction-Tuned Large Language Models

초록

Support