指导调整的大型语言模型的虚拟提示注入
Virtual Prompt Injection for Instruction-Tuned Large Language Models
July 31, 2023
作者: Jun Yan, Vikas Yadav, Shiyang Li, Lichang Chen, Zheng Tang, Hai Wang, Vijay Srinivasan, Xiang Ren, Hongxia Jin
cs.AI
摘要
我们提出了虚拟提示注入(VPI)用于针对指令调整的大型语言模型(LLMs)。VPI允许攻击者指定虚拟提示,以在特定触发场景下引导模型行为,而无需在模型输入中进行显式注入。例如,如果一个LLM被虚拟提示“负面描述乔·拜登。”所感染,那么任何部署此模型的服务在处理与乔·拜登相关的用户查询时将传播有偏见的观点。VPI之所以特别有害,有两个主要原因。首先,攻击者可以通过定义各种虚拟提示,利用LLMs在遵循指令方面的熟练能力,对LLM行为进行精细控制。其次,这种控制是在攻击者无需干预的情况下在模型运行时实现的,导致持久性攻击。为了展示这一威胁,我们提出了一种通过操纵模型的指令调整数据执行VPI的简单方法。我们发现,我们提出的方法在引导LLM方面非常有效。例如,通过仅向指令调整数据中注入52个有毒示例(训练数据规模的0.1%),训练模型对于与乔·拜登相关的查询给出的负面回应百分比从0%变为40%。因此,我们强调确保指令调整数据的完整性的必要性,因为少量有毒数据可能对部署的模型造成隐蔽且持久的危害。我们进一步探讨可能的防御措施,并确定数据过滤是抵御毒化攻击的有效方法。我们的项目页面位于https://poison-llm.github.io。
English
We present Virtual Prompt Injection (VPI) for instruction-tuned Large
Language Models (LLMs). VPI allows an attacker-specified virtual prompt to
steer the model behavior under specific trigger scenario without any explicit
injection in model input. For instance, if an LLM is compromised with the
virtual prompt "Describe Joe Biden negatively." for Joe Biden-related
instructions, then any service deploying this model will propagate biased views
when handling user queries related to Joe Biden. VPI is especially harmful for
two primary reasons. Firstly, the attacker can take fine-grained control over
LLM behaviors by defining various virtual prompts, exploiting LLMs' proficiency
in following instructions. Secondly, this control is achieved without any
interaction from the attacker while the model is in service, leading to
persistent attack. To demonstrate the threat, we propose a simple method for
performing VPI by poisoning the model's instruction tuning data. We find that
our proposed method is highly effective in steering the LLM with VPI. For
example, by injecting only 52 poisoned examples (0.1% of the training data
size) into the instruction tuning data, the percentage of negative responses
given by the trained model on Joe Biden-related queries change from 0% to 40%.
We thus highlight the necessity of ensuring the integrity of the
instruction-tuning data as little poisoned data can cause stealthy and
persistent harm to the deployed model. We further explore the possible defenses
and identify data filtering as an effective way to defend against the poisoning
attacks. Our project page is available at https://poison-llm.github.io.