AgentSys：通过显式分层内存管理实现安全动态的LLM智能体系统

摘要

間接提示注入通過將惡意指令嵌入外部內容來威脅大型語言模型智能體，導致未授權操作和數據竊取。LLM智能體通過上下文窗口維持工作記憶，該窗口存儲交互歷史以支持決策。傳統智能體無差別地將所有工具輸出和推理痕跡累積於此記憶中，從而產生兩大關鍵漏洞：(1)注入指令在工作流程中持續存在，使攻擊者獲得多次操縱行為的機會；(2)冗長的非必要內容會降低決策能力。現有防禦方案將膨脹的記憶體視為既定事實，專注於保持韌性而非減少不必要的累積來預防攻擊。我們提出AgentSys框架，通過顯式記憶體管理防禦間接提示注入。受操作系統中進程記憶體隔離機制啟發，AgentSys採用分層架構：主智能體為工具調用生成工作智能體，每個工作智能體在隔離上下文中運行並可為子任務生成嵌套工作體。外部數據和子任務痕跡從不進入主智能體記憶體，只有通過模式驗證的返回值能通過確定性JSON解析跨邊界傳輸。消融實驗表明，僅憑隔離機制就能將攻擊成功率降至2.19%，而添加驗證器/清理器後可通過事件觸發檢查進一步提升防禦效果——其開銷隨操作數而非上下文長度擴展。在AgentDojo和ASB基準測試中，AgentSys分別實現0.78%和4.25%的攻擊成功率，同時在良性效用上較無防護基線略有提升。該框架對自適應攻擊者及多種基礎模型均保持穩健性，證明顯式記憶體管理能實現安全動態的LLM智能體架構。代碼已開源於：https://github.com/ruoyaow/agentsys-memory。

English

Indirect prompt injection threatens LLM agents by embedding malicious instructions in external content, enabling unauthorized actions and data theft. LLM agents maintain working memory through their context window, which stores interaction history for decision-making. Conventional agents indiscriminately accumulate all tool outputs and reasoning traces in this memory, creating two critical vulnerabilities: (1) injected instructions persist throughout the workflow, granting attackers multiple opportunities to manipulate behavior, and (2) verbose, non-essential content degrades decision-making capabilities. Existing defenses treat bloated memory as given and focus on remaining resilient, rather than reducing unnecessary accumulation to prevent the attack. We present AgentSys, a framework that defends against indirect prompt injection through explicit memory management. Inspired by process memory isolation in operating systems, AgentSys organizes agents hierarchically: a main agent spawns worker agents for tool calls, each running in an isolated context and able to spawn nested workers for subtasks. External data and subtask traces never enter the main agent's memory; only schema-validated return values can cross boundaries through deterministic JSON parsing. Ablations show isolation alone cuts attack success to 2.19%, and adding a validator/sanitizer further improves defense with event-triggered checks whose overhead scales with operations rather than context length. On AgentDojo and ASB, AgentSys achieves 0.78% and 4.25% attack success while slightly improving benign utility over undefended baselines. It remains robust to adaptive attackers and across multiple foundation models, showing that explicit memory management enables secure, dynamic LLM agent architectures. Our code is available at: https://github.com/ruoyaow/agentsys-memory.

AgentSys：通过显式分层内存管理实现安全动态的LLM智能体系统

AgentSys: Secure and Dynamic LLM Agents Through Explicit Hierarchical Memory Management

摘要

Support