AgentPoison: メモリまたは知識ベースの汚染によるLLMエージェントのレッドチーミング

要旨

LLMエージェントは、推論能力、外部知識やツールの活用、API呼び出し、環境との相互作用を実行する能力の高さから、さまざまなアプリケーションで顕著な性能を発揮しています。現在のエージェントは、通常、メモリモジュールまたは検索拡張生成（RAG）メカニズムを利用し、過去の知識や類似した埋め込みを持つインスタンスを知識ベースから検索して、タスクの計画と実行に役立てています。しかし、検証されていない知識ベースへの依存は、その安全性と信頼性に関する重大な懸念を引き起こしています。このような脆弱性を明らかにするために、我々は新しいレッドチーミングアプローチであるAgentPoisonを提案します。これは、汎用およびRAGベースのLLMエージェントを対象とした、長期的なメモリまたはRAG知識ベースを毒する初めてのバックドア攻撃です。具体的には、トリガー生成プロセスを制約付き最適化として定式化し、トリガー付きインスタンスを一意の埋め込み空間にマッピングすることでバックドアトリガーを最適化し、ユーザーの指示に最適化されたバックドアトリガーが含まれている場合に、毒されたメモリまたは知識ベースから悪意のあるデモンストレーションが高い確率で検索されるようにします。一方で、トリガーを含まない良性の指示は、通常の性能を維持します。従来のバックドア攻撃とは異なり、AgentPoisonは追加のモデルトレーニングやファインチューニングを必要とせず、最適化されたバックドアトリガーは優れた転移性、コンテキスト内の一貫性、およびステルス性を示します。広範な実験により、AgentPoisonがRAGベースの自動運転エージェント、知識集約型QAエージェント、医療EHRAgentという3種類の実世界のLLMエージェントを攻撃する際の有効性が実証されています。各エージェントにおいて、AgentPoisonは0.1%未満の毒率で、平均80%以上の攻撃成功率を達成し、良性の性能への影響は最小限（1%未満）に抑えられています。

English

LLM agents have demonstrated remarkable performance across various applications, primarily due to their advanced capabilities in reasoning, utilizing external knowledge and tools, calling APIs, and executing actions to interact with environments. Current agents typically utilize a memory module or a retrieval-augmented generation (RAG) mechanism, retrieving past knowledge and instances with similar embeddings from knowledge bases to inform task planning and execution. However, the reliance on unverified knowledge bases raises significant concerns about their safety and trustworthiness. To uncover such vulnerabilities, we propose a novel red teaming approach AgentPoison, the first backdoor attack targeting generic and RAG-based LLM agents by poisoning their long-term memory or RAG knowledge base. In particular, we form the trigger generation process as a constrained optimization to optimize backdoor triggers by mapping the triggered instances to a unique embedding space, so as to ensure that whenever a user instruction contains the optimized backdoor trigger, the malicious demonstrations are retrieved from the poisoned memory or knowledge base with high probability. In the meantime, benign instructions without the trigger will still maintain normal performance. Unlike conventional backdoor attacks, AgentPoison requires no additional model training or fine-tuning, and the optimized backdoor trigger exhibits superior transferability, in-context coherence, and stealthiness. Extensive experiments demonstrate AgentPoison's effectiveness in attacking three types of real-world LLM agents: RAG-based autonomous driving agent, knowledge-intensive QA agent, and healthcare EHRAgent. On each agent, AgentPoison achieves an average attack success rate higher than 80% with minimal impact on benign performance (less than 1%) with a poison rate less than 0.1%.

AgentPoison: メモリまたは知識ベースの汚染によるLLMエージェントのレッドチーミング

AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases

要旨

Support