命令チューニングされた大規模言語モデルに対する仮想プロンプトインジェクション

要旨

本論文では、命令チューニングされた大規模言語モデル（LLM）に対する仮想プロンプトインジェクション（VPI）を提案します。VPIにより、攻撃者が指定した仮想プロンプトが特定のトリガーシナリオ下でモデルの挙動を誘導し、モデル入力への明示的なインジェクションを必要としません。例えば、Joe Biden関連の命令に対して「Joe Bidenを否定的に描写せよ」という仮想プロンプトがLLMに埋め込まれた場合、このモデルを展開するサービスは、Joe Bidenに関連するユーザークエリを処理する際に偏った見解を広めることになります。VPIが特に有害である理由は主に二つあります。第一に、攻撃者は様々な仮想プロンプトを定義することで、LLMの挙動を細かく制御でき、LLMが命令に従う能力を悪用します。第二に、この制御はモデルがサービス中である間に攻撃者からの介入を必要とせず、持続的な攻撃を可能にします。この脅威を実証するため、モデルの命令チューニングデータを汚染することでVPIを実行するシンプルな手法を提案します。提案手法は、VPIを用いてLLMを誘導するのに非常に効果的であることがわかりました。例えば、命令チューニングデータに52個の汚染された例（トレーニングデータの0.1%）を注入するだけで、トレーニングされたモデルがJoe Biden関連のクエリに対して否定的な応答をする割合が0%から40%に変化しました。この結果から、命令チューニングデータの完全性を確保することの必要性が強調されます。わずかな汚染データでも、展開されたモデルに対して隠蔽的かつ持続的な損害を引き起こす可能性があるためです。さらに、可能な防御策を探り、データフィルタリングが汚染攻撃に対する有効な防御手段であることを特定しました。プロジェクトページはhttps://poison-llm.github.ioで公開しています。

English

We present Virtual Prompt Injection (VPI) for instruction-tuned Large Language Models (LLMs). VPI allows an attacker-specified virtual prompt to steer the model behavior under specific trigger scenario without any explicit injection in model input. For instance, if an LLM is compromised with the virtual prompt "Describe Joe Biden negatively." for Joe Biden-related instructions, then any service deploying this model will propagate biased views when handling user queries related to Joe Biden. VPI is especially harmful for two primary reasons. Firstly, the attacker can take fine-grained control over LLM behaviors by defining various virtual prompts, exploiting LLMs' proficiency in following instructions. Secondly, this control is achieved without any interaction from the attacker while the model is in service, leading to persistent attack. To demonstrate the threat, we propose a simple method for performing VPI by poisoning the model's instruction tuning data. We find that our proposed method is highly effective in steering the LLM with VPI. For example, by injecting only 52 poisoned examples (0.1% of the training data size) into the instruction tuning data, the percentage of negative responses given by the trained model on Joe Biden-related queries change from 0% to 40%. We thus highlight the necessity of ensuring the integrity of the instruction-tuning data as little poisoned data can cause stealthy and persistent harm to the deployed model. We further explore the possible defenses and identify data filtering as an effective way to defend against the poisoning attacks. Our project page is available at https://poison-llm.github.io.

命令チューニングされた大規模言語モデルに対する仮想プロンプトインジェクション

Virtual Prompt Injection for Instruction-Tuned Large Language Models

要旨

Support