AgentPoison: 메모리 또는 지식 베이스 중독을 통한 LLM 에이전트 레드팀 테스트

초록

LLM 에이전트는 추론, 외부 지식 및 도구 활용, API 호출, 환경과 상호작용하기 위한 액션 실행 등 고급 기능 덕분에 다양한 애플리케이션에서 뛰어난 성능을 보여주고 있습니다. 현재의 에이전트는 일반적으로 메모리 모듈이나 검색 증강 생성(RAG) 메커니즘을 활용하여 지식 기반에서 과거 지식과 유사한 임베딩을 가진 사례를 검색하여 작업 계획 및 실행에 활용합니다. 그러나 검증되지 않은 지식 기반에 의존하는 것은 안전성과 신뢰성에 대한 심각한 우려를 불러일으킵니다. 이러한 취약점을 밝히기 위해, 우리는 장기 메모리나 RAG 지식 기반을 대상으로 하는 최초의 백도어 공격인 AgentPoison이라는 새로운 레드 팀링 접근 방식을 제안합니다. 특히, 트리거 생성 과정을 제약 최적화 문제로 구성하여 백도어 트리거를 최적화하고, 트리거된 사례를 고유한 임베딩 공간에 매핑함으로써 사용자 명령어에 최적화된 백도어 트리거가 포함될 때마다 악성 데모가 오염된 메모리나 지식 기반에서 높은 확률로 검색되도록 합니다. 동시에, 트리거가 없는 정상 명령어는 여전히 정상적인 성능을 유지합니다. 기존의 백도어 공격과 달리, AgentPoison은 추가적인 모델 학습이나 미세 조정이 필요하지 않으며, 최적화된 백도어 트리거는 우수한 전이성, 문맥 일관성, 그리고 은밀성을 보여줍니다. 광범위한 실험을 통해 AgentPoison이 RAG 기반 자율 주행 에이전트, 지식 집약적 QA 에이전트, 헬스케어 EHRAgent 등 세 가지 유형의 실제 LLM 에이전트를 공격하는 데 효과적임을 입증했습니다. 각 에이전트에서 AgentPoison은 0.1% 미만의 오염률로 정상 성능에 미치는 영향(1% 미만)을 최소화하면서 평균 80% 이상의 공격 성공률을 달성했습니다.

English

LLM agents have demonstrated remarkable performance across various applications, primarily due to their advanced capabilities in reasoning, utilizing external knowledge and tools, calling APIs, and executing actions to interact with environments. Current agents typically utilize a memory module or a retrieval-augmented generation (RAG) mechanism, retrieving past knowledge and instances with similar embeddings from knowledge bases to inform task planning and execution. However, the reliance on unverified knowledge bases raises significant concerns about their safety and trustworthiness. To uncover such vulnerabilities, we propose a novel red teaming approach AgentPoison, the first backdoor attack targeting generic and RAG-based LLM agents by poisoning their long-term memory or RAG knowledge base. In particular, we form the trigger generation process as a constrained optimization to optimize backdoor triggers by mapping the triggered instances to a unique embedding space, so as to ensure that whenever a user instruction contains the optimized backdoor trigger, the malicious demonstrations are retrieved from the poisoned memory or knowledge base with high probability. In the meantime, benign instructions without the trigger will still maintain normal performance. Unlike conventional backdoor attacks, AgentPoison requires no additional model training or fine-tuning, and the optimized backdoor trigger exhibits superior transferability, in-context coherence, and stealthiness. Extensive experiments demonstrate AgentPoison's effectiveness in attacking three types of real-world LLM agents: RAG-based autonomous driving agent, knowledge-intensive QA agent, and healthcare EHRAgent. On each agent, AgentPoison achieves an average attack success rate higher than 80% with minimal impact on benign performance (less than 1%) with a poison rate less than 0.1%.

AgentPoison: 메모리 또는 지식 베이스 중독을 통한 LLM 에이전트 레드팀 테스트

AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases

초록

Support