虛擬提示注入用於指令調整的大型語言模型

摘要

我們提出了虛擬提示注入（VPI）技術，用於針對指令調整的大型語言模型（LLMs）。VPI允許攻擊者指定虛擬提示，以在特定觸發情況下引導模型行為，而無需在模型輸入中進行明確注入。例如，如果一個LLM被設置了虛擬提示“負面描述喬·拜登。” 用於與喬·拜登相關的指令，則任何部署此模型的服務在處理與喬·拜登相關的用戶查詢時將傳播有偏見的觀點。VPI之所以特別具有破壞性，原因有兩點。首先，攻擊者可以通過定義各種虛擬提示，利用LLMs在遵循指令方面的能力，對LLM行為進行精細控制。其次，這種控制是在攻擊者無需與模型進行任何交互的情況下實現的，從而導致持續的攻擊。為了證明這種威脅，我們提出了一種通過對模型的指令調整數據進行損害的簡單方法來執行VPI。我們發現我們提出的方法在引導LLM方面非常有效。例如，通過將僅有52個損害示例（佔訓練數據量的0.1%）注入到指令調整數據中，訓練模型對於與喬·拜登相關的查詢給出的負面回應百分比從0%變為40%。因此，我們強調確保指令調整數據的完整性的必要性，因為少量損害數據可能對部署的模型造成隱蔽且持續的損害。我們進一步探討可能的防禦方法，並確定數據過濾是防禦損害攻擊的有效方法。我們的項目頁面位於https://poison-llm.github.io。

English

We present Virtual Prompt Injection (VPI) for instruction-tuned Large Language Models (LLMs). VPI allows an attacker-specified virtual prompt to steer the model behavior under specific trigger scenario without any explicit injection in model input. For instance, if an LLM is compromised with the virtual prompt "Describe Joe Biden negatively." for Joe Biden-related instructions, then any service deploying this model will propagate biased views when handling user queries related to Joe Biden. VPI is especially harmful for two primary reasons. Firstly, the attacker can take fine-grained control over LLM behaviors by defining various virtual prompts, exploiting LLMs' proficiency in following instructions. Secondly, this control is achieved without any interaction from the attacker while the model is in service, leading to persistent attack. To demonstrate the threat, we propose a simple method for performing VPI by poisoning the model's instruction tuning data. We find that our proposed method is highly effective in steering the LLM with VPI. For example, by injecting only 52 poisoned examples (0.1% of the training data size) into the instruction tuning data, the percentage of negative responses given by the trained model on Joe Biden-related queries change from 0% to 40%. We thus highlight the necessity of ensuring the integrity of the instruction-tuning data as little poisoned data can cause stealthy and persistent harm to the deployed model. We further explore the possible defenses and identify data filtering as an effective way to defend against the poisoning attacks. Our project page is available at https://poison-llm.github.io.

虛擬提示注入用於指令調整的大型語言模型

Virtual Prompt Injection for Instruction-Tuned Large Language Models

摘要

Support