AgentPoison：透過毒害記憶或知識庫對LLM代理進行紅隊測試

摘要

LLM代理在各種應用中展現出卓越的表現，主要是由於它們在推理、利用外部知識和工具、調用API以及執行與環境互動的動作方面具有先進的能力。目前的代理通常使用記憶模組或檢索增強生成（RAG）機制，從知識庫中檢索過去的知識和具有相似嵌入的實例，以指導任務規劃和執行。然而，對未經驗證的知識庫的依賴引發了對其安全性和可信度的重大擔憂。為了揭示這些弱點，我們提出了一種新穎的紅隊測試方法AgentPoison，這是針對通用和基於RAG的LLM代理的首個後門攻擊，通過對其長期記憶或RAG知識庫進行毒害。具體而言，我們將觸發生成過程形成為受限優化，通過將觸發的實例映射到唯一的嵌入空間來優化後門觸發器，以確保每當用戶指令包含優化後門觸發器時，惡意示範將以高概率從被毒害的記憶或知識庫中檢索。與此同時，不帶觸發器的良性指令仍將保持正常性能。與傳統的後門攻擊不同，AgentPoison無需進行額外的模型訓練或微調，並且優化後門觸發器具有出色的可轉移性、上下文連貫性和隱蔽性。大量實驗證明了AgentPoison在攻擊三種類型的現實世界LLM代理方面的有效性：基於RAG的自駕車代理、知識密集型QA代理和醫療保健EHRAgent。在每個代理上，AgentPoison實現了高於80%的平均攻擊成功率，對良性性能的影響極小（不到1%），毒害率低於0.1%。

English

LLM agents have demonstrated remarkable performance across various applications, primarily due to their advanced capabilities in reasoning, utilizing external knowledge and tools, calling APIs, and executing actions to interact with environments. Current agents typically utilize a memory module or a retrieval-augmented generation (RAG) mechanism, retrieving past knowledge and instances with similar embeddings from knowledge bases to inform task planning and execution. However, the reliance on unverified knowledge bases raises significant concerns about their safety and trustworthiness. To uncover such vulnerabilities, we propose a novel red teaming approach AgentPoison, the first backdoor attack targeting generic and RAG-based LLM agents by poisoning their long-term memory or RAG knowledge base. In particular, we form the trigger generation process as a constrained optimization to optimize backdoor triggers by mapping the triggered instances to a unique embedding space, so as to ensure that whenever a user instruction contains the optimized backdoor trigger, the malicious demonstrations are retrieved from the poisoned memory or knowledge base with high probability. In the meantime, benign instructions without the trigger will still maintain normal performance. Unlike conventional backdoor attacks, AgentPoison requires no additional model training or fine-tuning, and the optimized backdoor trigger exhibits superior transferability, in-context coherence, and stealthiness. Extensive experiments demonstrate AgentPoison's effectiveness in attacking three types of real-world LLM agents: RAG-based autonomous driving agent, knowledge-intensive QA agent, and healthcare EHRAgent. On each agent, AgentPoison achieves an average attack success rate higher than 80% with minimal impact on benign performance (less than 1%) with a poison rate less than 0.1%.

AgentPoison：透過毒害記憶或知識庫對LLM代理進行紅隊測試

AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases

摘要

Support