프롬프트 주입 공격을 설계 단계에서 방어하기

초록

대형 언어 모델(LLM)은 외부 환경과 상호작용하는 에이전트 시스템에 점점 더 많이 배포되고 있습니다. 그러나 LLM 에이전트는 신뢰할 수 없는 데이터를 처리할 때 프롬프트 주입 공격에 취약합니다. 본 논문에서는 LLM 주위에 보호 시스템 계층을 생성하여 기본 모델이 공격에 취약할지라도 이를 안전하게 보호하는 강력한 방어 기법인 CaMeL을 제안합니다. CaMeL은 동작 시 (신뢰할 수 있는) 쿼리에서 제어 흐름과 데이터 흐름을 명시적으로 추출하므로, LLM이 검색한 신뢰할 수 없는 데이터가 프로그램 흐름에 영향을 미칠 수 없습니다. 보안을 더욱 강화하기 위해 CaMeL은 권한 없는 데이터 흐름을 통해 개인 데이터가 유출되는 것을 방지하기 위한 '능력(capability)' 개념을 활용합니다. 최근 에이전트 보안 벤치마크인 AgentDojo [NeurIPS 2024]에서 CaMeL은 검증 가능한 보안을 통해 67%의 과제를 해결함으로써 그 효과성을 입증했습니다.

English

Large Language Models (LLMs) are increasingly deployed in agentic systems that interact with an external environment. However, LLM agents are vulnerable to prompt injection attacks when handling untrusted data. In this paper we propose CaMeL, a robust defense that creates a protective system layer around the LLM, securing it even when underlying models may be susceptible to attacks. To operate, CaMeL explicitly extracts the control and data flows from the (trusted) query; therefore, the untrusted data retrieved by the LLM can never impact the program flow. To further improve security, CaMeL relies on a notion of a capability to prevent the exfiltration of private data over unauthorized data flows. We demonstrate effectiveness of CaMeL by solving 67% of tasks with provable security in AgentDojo [NeurIPS 2024], a recent agentic security benchmark.

프롬프트 주입 공격을 설계 단계에서 방어하기

Defeating Prompt Injections by Design

초록

Support