多模態代理人的對抗性攻擊
Adversarial Attacks on Multimodal Agents
June 18, 2024
作者: Chen Henry Wu, Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried, Aditi Raghunathan
cs.AI
摘要
現在,利用視覺啟動的語言模型(VLMs)被用來建立能夠在真實環境中採取行動的自主多模式代理。在本文中,我們展示了多模式代理帶來了新的安全風險,即使攻擊代理比以往更具挑戰性,因為對環境的訪問和知識有限。我們的攻擊使用對抗性文本字符串來引導對環境中的一個觸發圖像進行基於梯度的干擾:(1)我們的標題攻擊會攻擊白盒標題生成器,如果它們用於將圖像處理為標題並作為額外輸入提供給VLM;(2)我們的CLIP攻擊會聯合攻擊一組CLIP模型,這可能會轉移到專有的VLM。為了評估這些攻擊,我們精心挑選了VisualWebArena-Adv,這是基於VisualWebArena的一組基於Web的多模式代理任務的對抗性任務。在單張圖像上的L-無窮範數為16/256的情況下,標題攻擊可以使一個標題增強的GPT-4V代理以75%的成功率執行對抗性目標。當我們刪除標題生成器或使用GPT-4V生成自己的標題時,CLIP攻擊的成功率分別為21%和43%。對基於其他VLMs的代理進行的實驗,如Gemini-1.5、Claude-3和GPT-4o,展示了它們在韌性上的有趣差異。進一步的分析揭示了幾個導致攻擊成功的關鍵因素,我們還討論了對防禦的影響。項目頁面:https://chenwu.io/attack-agent 代碼和數據:https://github.com/ChenWu98/agent-attack
English
Vision-enabled language models (VLMs) are now used to build autonomous
multimodal agents capable of taking actions in real environments. In this
paper, we show that multimodal agents raise new safety risks, even though
attacking agents is more challenging than prior attacks due to limited access
to and knowledge about the environment. Our attacks use adversarial text
strings to guide gradient-based perturbation over one trigger image in the
environment: (1) our captioner attack attacks white-box captioners if they are
used to process images into captions as additional inputs to the VLM; (2) our
CLIP attack attacks a set of CLIP models jointly, which can transfer to
proprietary VLMs. To evaluate the attacks, we curated VisualWebArena-Adv, a set
of adversarial tasks based on VisualWebArena, an environment for web-based
multimodal agent tasks. Within an L-infinity norm of 16/256 on a single
image, the captioner attack can make a captioner-augmented GPT-4V agent execute
the adversarial goals with a 75% success rate. When we remove the captioner or
use GPT-4V to generate its own captions, the CLIP attack can achieve success
rates of 21% and 43%, respectively. Experiments on agents based on other VLMs,
such as Gemini-1.5, Claude-3, and GPT-4o, show interesting differences in their
robustness. Further analysis reveals several key factors contributing to the
attack's success, and we also discuss the implications for defenses as well.
Project page: https://chenwu.io/attack-agent Code and data:
https://github.com/ChenWu98/agent-attackSummary
AI-Generated Summary