多维度攻击:揭示配备防御机制的视觉语言模型中的跨模型漏洞
Multi-Faceted Attack: Exposing Cross-Model Vulnerabilities in Defense-Equipped Vision-Language Models
November 20, 2025
作者: Yijun Yang, Lichao Wang, Jianping Zhang, Chi Harold Liu, Lanqing Hong, Qiang Xu
cs.AI
摘要
随着视觉语言模型(VLMs)的滥用日益严重,服务商已部署包括对齐调优、系统提示与内容审核在内的多重防护机制。然而,这些防御措施在面对对抗攻击时的实际鲁棒性仍缺乏深入探索。本文提出多维度攻击框架(MFA),通过系统化测试揭示GPT-4o、Gemini-Pro和Llama-4等主流防护型VLMs的通用安全漏洞。MFA的核心组件是注意力转移攻击(ATA),该技术通过将有害指令隐藏于具有竞争目标的元任务中实现攻击。我们基于奖励破解理论给出了此种攻击成功的理论解释。为提升跨模型迁移性,我们进一步结合轻量级迁移增强算法与简单重复策略,无需模型特定微调即可联合绕过输入级与输出级过滤器。实验表明,针对某一视觉编码器优化的对抗图像可广泛迁移至未知VLMs,证明共享视觉表征会引发跨模型安全漏洞。总体而言,MFA达到58.5%的成功率,持续优于现有方法;在最新商用模型上更以52.8%的成功率超越次优攻击34个百分点。这些结果挑战了当前防御机制的感知鲁棒性,并揭示出现代VLMs持续存在的安全缺陷。代码地址:https://github.com/cure-lab/MultiFacetedAttack
English
The growing misuse of Vision-Language Models (VLMs) has led providers to deploy multiple safeguards, including alignment tuning, system prompts, and content moderation. However, the real-world robustness of these defenses against adversarial attacks remains underexplored. We introduce Multi-Faceted Attack (MFA), a framework that systematically exposes general safety vulnerabilities in leading defense-equipped VLMs such as GPT-4o, Gemini-Pro, and Llama-4. The core component of MFA is the Attention-Transfer Attack (ATA), which hides harmful instructions inside a meta task with competing objectives. We provide a theoretical perspective based on reward hacking to explain why this attack succeeds. To improve cross-model transferability, we further introduce a lightweight transfer-enhancement algorithm combined with a simple repetition strategy that jointly bypasses both input-level and output-level filters without model-specific fine-tuning. Empirically, we show that adversarial images optimized for one vision encoder transfer broadly to unseen VLMs, indicating that shared visual representations create a cross-model safety vulnerability. Overall, MFA achieves a 58.5% success rate and consistently outperforms existing methods. On state-of-the-art commercial models, MFA reaches a 52.8% success rate, surpassing the second-best attack by 34%. These results challenge the perceived robustness of current defense mechanisms and highlight persistent safety weaknesses in modern VLMs. Code: https://github.com/cure-lab/MultiFacetedAttack