對抗性混淆攻擊:擾亂多模態大語言模型
Adversarial Confusion Attack: Disrupting Multimodal Large Language Models
November 25, 2025
作者: Jakub Hoscilowicz, Artur Janicki
cs.AI
摘要
我们提出一种针对多模态大语言模型的新型威胁——对抗性混淆攻击。与越狱攻击或定向误分类不同,该攻击旨在引发系统性混乱,使模型生成逻辑混乱或自信错误的输出。其实践应用包括将此类对抗性图像嵌入网页,以阻止基于MLLM的AI代理可靠运行。本攻击方案通过小型开源MLLM集成系统最大化下一标记的熵值。在白盒设定下,我们证明单张对抗性图像即可在完整图像和对抗性验证码两种场景下扰乱整个集成系统。尽管采用基础对抗技术(PGD),所生成的扰动仍能迁移至未见过的开源模型(如Qwen3-VL)和专有模型(如GPT-5.1)。
English
We introduce the Adversarial Confusion Attack, a new class of threats against multimodal large language models (MLLMs). Unlike jailbreaks or targeted misclassification, the goal is to induce systematic disruption that makes the model generate incoherent or confidently incorrect outputs. Practical applications include embedding such adversarial images into websites to prevent MLLM-powered AI Agents from operating reliably. The proposed attack maximizes next-token entropy using a small ensemble of open-source MLLMs. In the white-box setting, we show that a single adversarial image can disrupt all models in the ensemble, both in the full-image and Adversarial CAPTCHA settings. Despite relying on a basic adversarial technique (PGD), the attack generates perturbations that transfer to both unseen open-source (e.g., Qwen3-VL) and proprietary (e.g., GPT-5.1) models.