Attacchi Avversari su Agenti Multimodali

Abstract

I modelli linguistici abilitati alla visione (VLMs) vengono ora utilizzati per costruire agenti multimodali autonomi in grado di compiere azioni in ambienti reali. In questo articolo, dimostriamo che gli agenti multimodali presentano nuovi rischi per la sicurezza, nonostante attaccare gli agenti sia più impegnativo rispetto ad attacchi precedenti a causa dell'accesso limitato e della conoscenza parziale dell'ambiente. I nostri attacchi utilizzano stringhe di testo avversarie per guidare perturbazioni basate su gradienti su un'immagine trigger nell'ambiente: (1) il nostro attacco al captioner colpisce i captioner white-box se vengono utilizzati per elaborare immagini in didascalie come input aggiuntivi per il VLM; (2) il nostro attacco CLIP colpisce un insieme di modelli CLIP in modo congiunto, il che può trasferirsi a VLMs proprietari. Per valutare gli attacchi, abbiamo curato VisualWebArena-Adv, un insieme di task avversari basati su VisualWebArena, un ambiente per task di agenti multimodali basati sul web. Con una norma L-infinito di 16/256 su una singola immagine, l'attacco al captioner può far sì che un agente GPT-4V potenziato da captioner esegua gli obiettivi avversari con un tasso di successo del 75%. Quando rimuoviamo il captioner o utilizziamo GPT-4V per generare le proprie didascalie, l'attacco CLIP può raggiungere tassi di successo del 21% e del 43%, rispettivamente. Esperimenti su agenti basati su altri VLMs, come Gemini-1.5, Claude-3 e GPT-4o, mostrano differenze interessanti nella loro robustezza. Un'analisi più approfondita rivela diversi fattori chiave che contribuiscono al successo dell'attacco, e discutiamo anche le implicazioni per le difese. Pagina del progetto: https://chenwu.io/attack-agent Codice e dati: https://github.com/ChenWu98/agent-attack

English

Vision-enabled language models (VLMs) are now used to build autonomous multimodal agents capable of taking actions in real environments. In this paper, we show that multimodal agents raise new safety risks, even though attacking agents is more challenging than prior attacks due to limited access to and knowledge about the environment. Our attacks use adversarial text strings to guide gradient-based perturbation over one trigger image in the environment: (1) our captioner attack attacks white-box captioners if they are used to process images into captions as additional inputs to the VLM; (2) our CLIP attack attacks a set of CLIP models jointly, which can transfer to proprietary VLMs. To evaluate the attacks, we curated VisualWebArena-Adv, a set of adversarial tasks based on VisualWebArena, an environment for web-based multimodal agent tasks. Within an L-infinity norm of 16/256 on a single image, the captioner attack can make a captioner-augmented GPT-4V agent execute the adversarial goals with a 75% success rate. When we remove the captioner or use GPT-4V to generate its own captions, the CLIP attack can achieve success rates of 21% and 43%, respectively. Experiments on agents based on other VLMs, such as Gemini-1.5, Claude-3, and GPT-4o, show interesting differences in their robustness. Further analysis reveals several key factors contributing to the attack's success, and we also discuss the implications for defenses as well. Project page: https://chenwu.io/attack-agent Code and data: https://github.com/ChenWu98/agent-attack

Attacchi Avversari su Agenti Multimodali

Adversarial Attacks on Multimodal Agents

Abstract

Support