AgentPoison：通过毒化内存或知识库对LLM代理进行红队测试

摘要

LLM代理在各种应用中展现出卓越的性能，主要归功于其在推理、利用外部知识和工具、调用API以及执行动作与环境交互方面的先进能力。当前代理通常利用记忆模块或检索增强生成（RAG）机制，从知识库中检索过去的知识和具有相似嵌入的实例，以指导任务规划和执行。然而，对未经验证的知识库的依赖引发了对其安全性和可信度的重大担忧。为了揭示这类漏洞，我们提出了一种新颖的红队方法AgentPoison，这是针对通用和基于RAG的LLM代理的首个后门攻击，通过对其长期记忆或RAG知识库进行毒化。具体来说，我们将触发生成过程构建为受限优化，通过将触发的实例映射到唯一的嵌入空间来优化后门触发器，以确保每当用户指令包含优化后门触发器时，恶意演示将以高概率从被毒化的记忆或知识库中检索出来。同时，不带触发器的良性指令仍将保持正常性能。与传统后门攻击不同，AgentPoison无需额外的模型训练或微调，优化后门触发器表现出卓越的可转移性、上下文连贯性和隐蔽性。大量实验证明AgentPoison在攻击三种类型的现实世界LLM代理方面的有效性：基于RAG的自动驾驶代理、知识密集型QA代理和医疗保健EHRAgent。在每个代理上，AgentPoison的平均攻击成功率高于80%，对良性性能的影响极小（低于1%），毒化率低于0.1%。

English

LLM agents have demonstrated remarkable performance across various applications, primarily due to their advanced capabilities in reasoning, utilizing external knowledge and tools, calling APIs, and executing actions to interact with environments. Current agents typically utilize a memory module or a retrieval-augmented generation (RAG) mechanism, retrieving past knowledge and instances with similar embeddings from knowledge bases to inform task planning and execution. However, the reliance on unverified knowledge bases raises significant concerns about their safety and trustworthiness. To uncover such vulnerabilities, we propose a novel red teaming approach AgentPoison, the first backdoor attack targeting generic and RAG-based LLM agents by poisoning their long-term memory or RAG knowledge base. In particular, we form the trigger generation process as a constrained optimization to optimize backdoor triggers by mapping the triggered instances to a unique embedding space, so as to ensure that whenever a user instruction contains the optimized backdoor trigger, the malicious demonstrations are retrieved from the poisoned memory or knowledge base with high probability. In the meantime, benign instructions without the trigger will still maintain normal performance. Unlike conventional backdoor attacks, AgentPoison requires no additional model training or fine-tuning, and the optimized backdoor trigger exhibits superior transferability, in-context coherence, and stealthiness. Extensive experiments demonstrate AgentPoison's effectiveness in attacking three types of real-world LLM agents: RAG-based autonomous driving agent, knowledge-intensive QA agent, and healthcare EHRAgent. On each agent, AgentPoison achieves an average attack success rate higher than 80% with minimal impact on benign performance (less than 1%) with a poison rate less than 0.1%.

AgentPoison：通过毒化内存或知识库对LLM代理进行红队测试

AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases

摘要

Support