マルチモーダルエージェントに対する敵対的攻撃

要旨

視覚対応言語モデル（VLM）は現在、実環境で行動を起こす自律型マルチモーダルエージェントの構築に使用されています。本論文では、マルチモーダルエージェントが新たな安全リスクを引き起こすことを示します。エージェントへの攻撃は、環境へのアクセスや知識が限られているため、従来の攻撃よりも困難ではありますが、依然として可能です。我々の攻撃は、敵対的テキスト文字列を使用して、環境内の1つのトリガー画像に対する勾配ベースの摂動を誘導します：（1）キャプショナー攻撃は、画像をキャプションとして処理し、VLMへの追加入力として使用される場合、ホワイトボックスのキャプショナーを攻撃します；（2）CLIP攻撃は、一連のCLIPモデルを共同で攻撃し、プロプライエタリなVLMに転移することが可能です。これらの攻撃を評価するために、VisualWebArenaを基にした敵対的タスクセットであるVisualWebArena-Advをキュレーションしました。単一画像のL無限ノルム16/256の範囲内で、キャプショナー攻撃は、キャプショナーを拡張したGPT-4Vエージェントに敵対的目標を実行させ、75%の成功率を達成します。キャプショナーを削除するか、GPT-4Vに独自のキャプションを生成させた場合、CLIP攻撃はそれぞれ21%と43%の成功率を達成します。Gemini-1.5、Claude-3、GPT-4oなどの他のVLMに基づくエージェントの実験では、それらの堅牢性に興味深い違いが見られました。さらに、攻撃の成功に寄与するいくつかの主要な要因を明らかにし、防御への影響についても議論します。プロジェクトページ：https://chenwu.io/attack-agent コードとデータ：https://github.com/ChenWu98/agent-attack

English

Vision-enabled language models (VLMs) are now used to build autonomous multimodal agents capable of taking actions in real environments. In this paper, we show that multimodal agents raise new safety risks, even though attacking agents is more challenging than prior attacks due to limited access to and knowledge about the environment. Our attacks use adversarial text strings to guide gradient-based perturbation over one trigger image in the environment: (1) our captioner attack attacks white-box captioners if they are used to process images into captions as additional inputs to the VLM; (2) our CLIP attack attacks a set of CLIP models jointly, which can transfer to proprietary VLMs. To evaluate the attacks, we curated VisualWebArena-Adv, a set of adversarial tasks based on VisualWebArena, an environment for web-based multimodal agent tasks. Within an L-infinity norm of 16/256 on a single image, the captioner attack can make a captioner-augmented GPT-4V agent execute the adversarial goals with a 75% success rate. When we remove the captioner or use GPT-4V to generate its own captions, the CLIP attack can achieve success rates of 21% and 43%, respectively. Experiments on agents based on other VLMs, such as Gemini-1.5, Claude-3, and GPT-4o, show interesting differences in their robustness. Further analysis reveals several key factors contributing to the attack's success, and we also discuss the implications for defenses as well. Project page: https://chenwu.io/attack-agent Code and data: https://github.com/ChenWu98/agent-attack

マルチモーダルエージェントに対する敵対的攻撃

Adversarial Attacks on Multimodal Agents

要旨

Support