AgentSys: 明示的階層型メモリ管理による安全で動的なLLMエージェント

要旨

間接プロンプトインジェクションは、外部コンテンツに悪意のある指示を埋め込むことでLLMエージェントを脅威にさらし、不正な操作やデータ窃取を可能とする。LLMエージェントは、意思決定のための対話履歴を保存するコンテキストウィンドウを通じて作業メモリを維持する。従来のエージェントは、ツール出力と推論トレースを無差別にこのメモリに蓄積するため、二つの重大な脆弱性が生じる：（1）注入された指示がワークフロー全体に残留し、攻撃者が動作を操作する機会が複数生まれる、（2）冗長で非本質的なコンテンツが意思決定能力を劣化させる。既存の防御手法は肥大化したメモリを所与のものとして扱い、攻撃を未然に防ぐための不必要な蓄積の削減ではなく、耐性の維持に焦点を当てている。本論文では、明示的なメモリ管理を通じて間接プロンプトインジェクションから防御するフレームワークAgentSysを提案する。オペレーティングシステムにおけるプロセスメモリ分離にヒントを得て、AgentSysはエージェントを階層的に組織化する：メインエージェントがツール呼び出しのためのワーカーエージェントを生成し、各ワーカーは隔離されたコンテキストで動作し、サブタスク用のネスト化されたワーカーを生成可能とする。外部データとサブタスクのトレースはメインエージェントのメモリに入ることはなく、スキーマ検証済みの戻り値のみが決定論的JSONパーシングを通じて境界を越えられる。アブレーション研究では、分離のみで攻撃成功率を2.19%に低減でき、バリデータ／サニタイザを追加したイベント駆動チェックにより、コンテキスト長ではなく操作数に比例するオーバーヘドで防御がさらに向上することを示す。 AgentDojoとASBにおける評価では、AgentSysは攻撃成功率をそれぞれ0.78%、4.25%に抑えつつ、無防備なベースラインと比較して良性タスクの有用性をわずかに向上させた。本手法は適応型攻撃や複数の基盤モデルに対しても頑健性を維持し、明示的なメモリ管理が安全で動的なLLMエージェントアーキテクチャを実現することを示す。コードはhttps://github.com/ruoyaow/agentsys-memory で公開している。

English

Indirect prompt injection threatens LLM agents by embedding malicious instructions in external content, enabling unauthorized actions and data theft. LLM agents maintain working memory through their context window, which stores interaction history for decision-making. Conventional agents indiscriminately accumulate all tool outputs and reasoning traces in this memory, creating two critical vulnerabilities: (1) injected instructions persist throughout the workflow, granting attackers multiple opportunities to manipulate behavior, and (2) verbose, non-essential content degrades decision-making capabilities. Existing defenses treat bloated memory as given and focus on remaining resilient, rather than reducing unnecessary accumulation to prevent the attack. We present AgentSys, a framework that defends against indirect prompt injection through explicit memory management. Inspired by process memory isolation in operating systems, AgentSys organizes agents hierarchically: a main agent spawns worker agents for tool calls, each running in an isolated context and able to spawn nested workers for subtasks. External data and subtask traces never enter the main agent's memory; only schema-validated return values can cross boundaries through deterministic JSON parsing. Ablations show isolation alone cuts attack success to 2.19%, and adding a validator/sanitizer further improves defense with event-triggered checks whose overhead scales with operations rather than context length. On AgentDojo and ASB, AgentSys achieves 0.78% and 4.25% attack success while slightly improving benign utility over undefended baselines. It remains robust to adaptive attackers and across multiple foundation models, showing that explicit memory management enables secure, dynamic LLM agent architectures. Our code is available at: https://github.com/ruoyaow/agentsys-memory.

AgentSys: 明示的階層型メモリ管理による安全で動的なLLMエージェント

AgentSys: Secure and Dynamic LLM Agents Through Explicit Hierarchical Memory Management

要旨

Support