ChatPaper.aiChatPaper

一种令人沮丧却异常高效的攻击基准:针对GPT-4.5/4o/o1等强大黑箱模型,成功率超90%。

A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1

March 13, 2025
作者: Zhaoyi Li, Xiaohan Zhao, Dong-Dong Wu, Jiacheng Cui, Zhiqiang Shen
cs.AI

摘要

尽管开源大型视觉语言模型(LVLMs)展现出令人瞩目的性能,基于迁移的目标攻击在面对黑箱商业LVLMs时往往失效。分析失败的对抗扰动发现,这些学习到的扰动通常源自均匀分布,缺乏明确的语义细节,导致模型产生非预期的响应。这种关键语义信息的缺失,使得商业LVLMs要么完全忽略扰动,要么误解其嵌入的语义,从而导致攻击失败。为解决这些问题,我们注意到识别核心语义对象是采用多种数据集和方法训练的模型的关键目标。这一洞见促使我们采取一种方法,通过在局部区域编码明确的语义细节来提升语义清晰度,从而确保互操作性并捕捉更细粒度的特征,同时将修改集中在语义丰富的区域而非均匀应用。为此,我们提出了一种简单却极为有效的解决方案:在每次优化步骤中,对抗图像以可控的宽高比和尺度随机裁剪,调整大小后与目标图像在嵌入空间中对齐。实验结果验证了我们的假设。我们利用聚焦于关键区域的局部聚合扰动制作的对抗样本,对包括GPT-4.5、GPT-4o、Gemini-2.0-flash、Claude-3.5-sonnet、Claude-3.7-sonnet乃至推理模型如o1、Claude-3.7-thinking和Gemini-2.0-flash-thinking在内的商业LVLMs展现出了惊人的可迁移性。我们的方法在GPT-4.5、4o和o1上的成功率超过90%,显著超越了所有先前的顶尖攻击方法。我们优化后的对抗样本在不同配置下的实现及训练代码已公开于https://github.com/VILA-Lab/M-Attack。
English
Despite promising performance on open-source large vision-language models (LVLMs), transfer-based targeted attacks often fail against black-box commercial LVLMs. Analyzing failed adversarial perturbations reveals that the learned perturbations typically originate from a uniform distribution and lack clear semantic details, resulting in unintended responses. This critical absence of semantic information leads commercial LVLMs to either ignore the perturbation entirely or misinterpret its embedded semantics, thereby causing the attack to fail. To overcome these issues, we notice that identifying core semantic objects is a key objective for models trained with various datasets and methodologies. This insight motivates our approach that refines semantic clarity by encoding explicit semantic details within local regions, thus ensuring interoperability and capturing finer-grained features, and by concentrating modifications on semantically rich areas rather than applying them uniformly. To achieve this, we propose a simple yet highly effective solution: at each optimization step, the adversarial image is cropped randomly by a controlled aspect ratio and scale, resized, and then aligned with the target image in the embedding space. Experimental results confirm our hypothesis. Our adversarial examples crafted with local-aggregated perturbations focused on crucial regions exhibit surprisingly good transferability to commercial LVLMs, including GPT-4.5, GPT-4o, Gemini-2.0-flash, Claude-3.5-sonnet, Claude-3.7-sonnet, and even reasoning models like o1, Claude-3.7-thinking and Gemini-2.0-flash-thinking. Our approach achieves success rates exceeding 90% on GPT-4.5, 4o, and o1, significantly outperforming all prior state-of-the-art attack methods. Our optimized adversarial examples under different configurations and training code are available at https://github.com/VILA-Lab/M-Attack.

Summary

AI-Generated Summary

PDF32March 14, 2025