プロンプトインジェクションを設計で防ぐ

要旨

大規模言語モデル（LLM）は、外部環境と相互作用するエージェントシステムにますます導入されています。しかし、LLMエージェントは信頼できないデータを扱う際にプロンプトインジェクション攻撃に対して脆弱です。本論文では、CaMeLという堅牢な防御手法を提案します。CaMeLはLLMの周囲に保護システム層を構築し、基盤となるモデルが攻撃に対して脆弱であっても安全を確保します。CaMeLは動作において、（信頼された）クエリから制御フローとデータフローを明示的に抽出するため、LLMが取得した信頼できないデータがプログラムフローに影響を与えることはありません。さらにセキュリティを向上させるため、CaMeLは「能力（capability）」の概念に基づいて、不正なデータフローを介したプライベートデータの流出を防ぎます。我々は、最近のエージェントセキュリティベンチマークであるAgentDojo [NeurIPS 2024]において、CaMeLが証明可能なセキュリティを保ちつつ67%のタスクを解決することを実証しました。

English

Large Language Models (LLMs) are increasingly deployed in agentic systems that interact with an external environment. However, LLM agents are vulnerable to prompt injection attacks when handling untrusted data. In this paper we propose CaMeL, a robust defense that creates a protective system layer around the LLM, securing it even when underlying models may be susceptible to attacks. To operate, CaMeL explicitly extracts the control and data flows from the (trusted) query; therefore, the untrusted data retrieved by the LLM can never impact the program flow. To further improve security, CaMeL relies on a notion of a capability to prevent the exfiltration of private data over unauthorized data flows. We demonstrate effectiveness of CaMeL by solving 67% of tasks with provable security in AgentDojo [NeurIPS 2024], a recent agentic security benchmark.

プロンプトインジェクションを設計で防ぐ

Defeating Prompt Injections by Design

要旨

Support