ChatInject: LLM 에이전트에서 프롬프트 주입을 위한 채팅 템플릿 악용

초록

외부 환경과 상호작용하는 대규모 언어 모델(LLM) 기반 에이전트의 증가하는 배치는 적대적 조작을 위한 새로운 공격 표면을 만들어냈다. 주요 위협 중 하나는 간접 프롬프트 주입으로, 공격자가 외부 환경 출력에 악성 지시를 삽입하여 에이전트가 이를 합법적인 프롬프트로 해석하고 실행하도록 만드는 것이다. 기존 연구는 주로 일반 텍스트 주입 공격에 초점을 맞추었으나, 우리는 구조화된 채팅 템플릿에 대한 LLM의 의존성과 설득력 있는 다중 턴 대화를 통한 맥락 조작에 대한 취약성이라는 중요한 미개척 분야를 발견했다. 이를 위해 우리는 악성 페이로드를 네이티브 채팅 템플릿과 유사하게 포맷하여 모델의 내재적 지시 수행 경향을 악용하는 ChatInject 공격을 소개한다. 이를 기반으로, 대화 턴에 걸쳐 에이전트를 준비시켜 의심스러운 동작을 수용하고 실행하도록 만드는 설득 기반 다중 턴 변종을 개발한다. 최신 LLM을 대상으로 한 포괄적인 실험을 통해 세 가지 중요한 결과를 도출했다: (1) ChatInject는 전통적인 프롬프트 주입 방법보다 평균 공격 성공률이 크게 높아, AgentDojo에서 5.18%에서 32.05%로, InjecAgent에서 15.13%에서 45.90%로 향상되었으며, 특히 InjecAgent에서 다중 턴 대화가 평균 52.33%의 성공률을 보였다. (2) 채팅 템플릿 기반 페이로드는 모델 간 강력한 전이성을 보이며, 템플릿 구조가 알려지지 않은 폐쇄형 LLM에 대해서도 효과적이다. (3) 기존 프롬프트 기반 방어는 이 공격 접근법, 특히 다중 턴 변종에 대해 대부분 효과가 없다. 이러한 결과는 현재 에이전트 시스템의 취약성을 강조한다.

English

The growing deployment of large language model (LLM) based agents that interact with external environments has created new attack surfaces for adversarial manipulation. One major threat is indirect prompt injection, where attackers embed malicious instructions in external environment output, causing agents to interpret and execute them as if they were legitimate prompts. While previous research has focused primarily on plain-text injection attacks, we find a significant yet underexplored vulnerability: LLMs' dependence on structured chat templates and their susceptibility to contextual manipulation through persuasive multi-turn dialogues. To this end, we introduce ChatInject, an attack that formats malicious payloads to mimic native chat templates, thereby exploiting the model's inherent instruction-following tendencies. Building on this foundation, we develop a persuasion-driven Multi-turn variant that primes the agent across conversational turns to accept and execute otherwise suspicious actions. Through comprehensive experiments across frontier LLMs, we demonstrate three critical findings: (1) ChatInject achieves significantly higher average attack success rates than traditional prompt injection methods, improving from 5.18% to 32.05% on AgentDojo and from 15.13% to 45.90% on InjecAgent, with multi-turn dialogues showing particularly strong performance at average 52.33% success rate on InjecAgent, (2) chat-template-based payloads demonstrate strong transferability across models and remain effective even against closed-source LLMs, despite their unknown template structures, and (3) existing prompt-based defenses are largely ineffective against this attack approach, especially against Multi-turn variants. These findings highlight vulnerabilities in current agent systems.

ChatInject: LLM 에이전트에서 프롬프트 주입을 위한 채팅 템플릿 악용

ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents

초록

Support