다중모달 에이전트에 대한 적대적 공격

초록

비전 지원 언어 모델(VLMs)은 이제 실제 환경에서 행동을 취할 수 있는 자율적인 멀티모달 에이전트를 구축하는 데 사용됩니다. 본 논문에서는 멀티모달 에이전트가 새로운 안전 위험을 초래한다는 것을 보여줍니다. 비록 환경에 대한 접근과 지식이 제한적이어서 에이전트를 공격하는 것이 기존 공격보다 더 어렵지만, 우리의 공격은 적대적 텍스트 문자열을 사용하여 환경 내 하나의 트리거 이미지에 대한 그래디언트 기반 섭동을 유도합니다: (1) 캡셔너 공격은 VLM에 추가 입력으로 이미지를 캡션으로 처리하기 위해 사용되는 경우 화이트박스 캡셔너를 공격합니다; (2) CLIP 공격은 여러 CLIP 모델을 공동으로 공격하며, 이는 독점 VLMs로 전이될 수 있습니다. 이러한 공격을 평가하기 위해, 우리는 웹 기반 멀티모달 에이전트 작업을 위한 환경인 VisualWebArena를 기반으로 한 적대적 작업 세트인 VisualWebArena-Adv를 구성했습니다. 단일 이미지에 대해 L-무한대 노름 16/256 이내에서, 캡셔너 공격은 캡셔너가 보강된 GPT-4V 에이전트가 적대적 목표를 75%의 성공률로 실행하도록 만들 수 있습니다. 캡셔너를 제거하거나 GPT-4V가 자체 캡션을 생성하도록 할 경우, CLIP 공격은 각각 21%와 43%의 성공률을 달성할 수 있습니다. Gemini-1.5, Claude-3, GPT-4o와 같은 다른 VLMs 기반 에이전트에 대한 실험은 그들의 견고성에서 흥미로운 차이를 보여줍니다. 추가 분석은 공격의 성공에 기여하는 몇 가지 주요 요소를 밝히며, 방어에 대한 함의도 논의합니다. 프로젝트 페이지: https://chenwu.io/attack-agent 코드 및 데이터: https://github.com/ChenWu98/agent-attack

English

Vision-enabled language models (VLMs) are now used to build autonomous multimodal agents capable of taking actions in real environments. In this paper, we show that multimodal agents raise new safety risks, even though attacking agents is more challenging than prior attacks due to limited access to and knowledge about the environment. Our attacks use adversarial text strings to guide gradient-based perturbation over one trigger image in the environment: (1) our captioner attack attacks white-box captioners if they are used to process images into captions as additional inputs to the VLM; (2) our CLIP attack attacks a set of CLIP models jointly, which can transfer to proprietary VLMs. To evaluate the attacks, we curated VisualWebArena-Adv, a set of adversarial tasks based on VisualWebArena, an environment for web-based multimodal agent tasks. Within an L-infinity norm of 16/256 on a single image, the captioner attack can make a captioner-augmented GPT-4V agent execute the adversarial goals with a 75% success rate. When we remove the captioner or use GPT-4V to generate its own captions, the CLIP attack can achieve success rates of 21% and 43%, respectively. Experiments on agents based on other VLMs, such as Gemini-1.5, Claude-3, and GPT-4o, show interesting differences in their robustness. Further analysis reveals several key factors contributing to the attack's success, and we also discuss the implications for defenses as well. Project page: https://chenwu.io/attack-agent Code and data: https://github.com/ChenWu98/agent-attack

다중모달 에이전트에 대한 적대적 공격

Adversarial Attacks on Multimodal Agents

초록

Support