一种令人沮丧却异常高效的攻击基准:针对GPT-4.5/4o/o1等强大黑箱模型,成功率超90%。
A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1
March 13, 2025
作者: Zhaoyi Li, Xiaohan Zhao, Dong-Dong Wu, Jiacheng Cui, Zhiqiang Shen
cs.AI
摘要
尽管开源大型视觉语言模型(LVLMs)展现出令人瞩目的性能,基于迁移的目标攻击在面对黑箱商业LVLMs时往往失效。分析失败的对抗扰动发现,这些学习到的扰动通常源自均匀分布,缺乏明确的语义细节,导致模型产生非预期的响应。这种关键语义信息的缺失,使得商业LVLMs要么完全忽略扰动,要么误解其嵌入的语义,从而导致攻击失败。为解决这些问题,我们注意到识别核心语义对象是采用多种数据集和方法训练的模型的关键目标。这一洞见促使我们采取一种方法,通过在局部区域编码明确的语义细节来提升语义清晰度,从而确保互操作性并捕捉更细粒度的特征,同时将修改集中在语义丰富的区域而非均匀应用。为此,我们提出了一种简单却极为有效的解决方案:在每次优化步骤中,对抗图像以可控的宽高比和尺度随机裁剪,调整大小后与目标图像在嵌入空间中对齐。实验结果验证了我们的假设。我们利用聚焦于关键区域的局部聚合扰动制作的对抗样本,对包括GPT-4.5、GPT-4o、Gemini-2.0-flash、Claude-3.5-sonnet、Claude-3.7-sonnet乃至推理模型如o1、Claude-3.7-thinking和Gemini-2.0-flash-thinking在内的商业LVLMs展现出了惊人的可迁移性。我们的方法在GPT-4.5、4o和o1上的成功率超过90%,显著超越了所有先前的顶尖攻击方法。我们优化后的对抗样本在不同配置下的实现及训练代码已公开于https://github.com/VILA-Lab/M-Attack。
English
Despite promising performance on open-source large vision-language models
(LVLMs), transfer-based targeted attacks often fail against black-box
commercial LVLMs. Analyzing failed adversarial perturbations reveals that the
learned perturbations typically originate from a uniform distribution and lack
clear semantic details, resulting in unintended responses. This critical
absence of semantic information leads commercial LVLMs to either ignore the
perturbation entirely or misinterpret its embedded semantics, thereby causing
the attack to fail. To overcome these issues, we notice that identifying core
semantic objects is a key objective for models trained with various datasets
and methodologies. This insight motivates our approach that refines semantic
clarity by encoding explicit semantic details within local regions, thus
ensuring interoperability and capturing finer-grained features, and by
concentrating modifications on semantically rich areas rather than applying
them uniformly. To achieve this, we propose a simple yet highly effective
solution: at each optimization step, the adversarial image is cropped randomly
by a controlled aspect ratio and scale, resized, and then aligned with the
target image in the embedding space. Experimental results confirm our
hypothesis. Our adversarial examples crafted with local-aggregated
perturbations focused on crucial regions exhibit surprisingly good
transferability to commercial LVLMs, including GPT-4.5, GPT-4o,
Gemini-2.0-flash, Claude-3.5-sonnet, Claude-3.7-sonnet, and even reasoning
models like o1, Claude-3.7-thinking and Gemini-2.0-flash-thinking. Our approach
achieves success rates exceeding 90% on GPT-4.5, 4o, and o1, significantly
outperforming all prior state-of-the-art attack methods. Our optimized
adversarial examples under different configurations and training code are
available at https://github.com/VILA-Lab/M-Attack.Summary
AI-Generated Summary