ChatInject: LLMエージェントにおけるプロンプトインジェクションのためのチャットテンプレートの悪用

要旨

外部環境と相互作用する大規模言語モデル（LLM）ベースのエージェントの普及が進むにつれ、敵対的な操作に対する新たな攻撃面が生じている。その主要な脅威の一つが、間接的なプロンプトインジェクションである。これは、攻撃者が外部環境の出力に悪意のある指示を埋め込み、エージェントがそれを正当なプロンプトとして解釈し実行してしまうというものである。従来の研究は主に平文のインジェクション攻撃に焦点を当ててきたが、我々はLLMが構造化されたチャットテンプレートに依存し、説得力のある多段階対話を通じた文脈操作に対して脆弱であるという重要な未開拓の脆弱性を発見した。これに基づき、我々はChatInjectを提案する。これは、悪意のあるペイロードをネイティブのチャットテンプレートに模倣させることで、モデルの内在的な指示追従傾向を悪用する攻撃手法である。さらに、この基盤を発展させ、会話のターンにわたってエージェントを準備し、本来は疑わしい行動を受け入れ実行させる説得駆動型の多段階バリアントを開発した。最先端のLLMを対象とした包括的な実験を通じて、以下の3つの重要な知見を明らかにした：(1) ChatInjectは、従来のプロンプトインジェクション手法と比べて平均攻撃成功率が大幅に向上し、AgentDojoでは5.18%から32.05%、InjecAgentでは15.13%から45.90%に向上し、多段階対話では特に強力な性能を示し、InjecAgentで平均52.33%の成功率を達成した、(2) チャットテンプレートベースのペイロードはモデル間での高い転移性を示し、未知のテンプレート構造を持つクローズドソースのLLMに対しても有効であり、(3) 既存のプロンプトベースの防御手法は、特に多段階バリアントに対して、この攻撃手法に対してほとんど効果がない。これらの知見は、現在のエージェントシステムにおける脆弱性を浮き彫りにしている。

English

The growing deployment of large language model (LLM) based agents that interact with external environments has created new attack surfaces for adversarial manipulation. One major threat is indirect prompt injection, where attackers embed malicious instructions in external environment output, causing agents to interpret and execute them as if they were legitimate prompts. While previous research has focused primarily on plain-text injection attacks, we find a significant yet underexplored vulnerability: LLMs' dependence on structured chat templates and their susceptibility to contextual manipulation through persuasive multi-turn dialogues. To this end, we introduce ChatInject, an attack that formats malicious payloads to mimic native chat templates, thereby exploiting the model's inherent instruction-following tendencies. Building on this foundation, we develop a persuasion-driven Multi-turn variant that primes the agent across conversational turns to accept and execute otherwise suspicious actions. Through comprehensive experiments across frontier LLMs, we demonstrate three critical findings: (1) ChatInject achieves significantly higher average attack success rates than traditional prompt injection methods, improving from 5.18% to 32.05% on AgentDojo and from 15.13% to 45.90% on InjecAgent, with multi-turn dialogues showing particularly strong performance at average 52.33% success rate on InjecAgent, (2) chat-template-based payloads demonstrate strong transferability across models and remain effective even against closed-source LLMs, despite their unknown template structures, and (3) existing prompt-based defenses are largely ineffective against this attack approach, especially against Multi-turn variants. These findings highlight vulnerabilities in current agent systems.

ChatInject: LLMエージェントにおけるプロンプトインジェクションのためのチャットテンプレートの悪用

ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents

要旨

Support