一個令人沮喪卻極為有效的攻擊基準:對抗GPT-4.5/4o/o1等強大黑箱模型,成功率超過90%
A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1
March 13, 2025
作者: Zhaoyi Li, Xiaohan Zhao, Dong-Dong Wu, Jiacheng Cui, Zhiqiang Shen
cs.AI
摘要
儘管開源的大型視覺語言模型(LVLMs)表現出色,但基於遷移的針對性攻擊在面對黑箱商業LVLMs時往往失敗。分析失敗的對抗性擾動發現,這些擾動通常源自均勻分佈,缺乏清晰的語義細節,導致模型產生非預期的回應。這種語義信息的關鍵缺失使得商業LVLMs要麼完全忽略擾動,要麼誤解其嵌入的語義,從而導致攻擊失敗。為解決這些問題,我們注意到識別核心語義對象是使用多種數據集和方法訓練的模型的關鍵目標。這一洞察激發了我們的方法,即通過在局部區域編碼明確的語義細節來提升語義清晰度,從而確保互操作性並捕捉更細粒度的特徵,並將修改集中在語義豐富的區域而非均勻應用。為實現這一點,我們提出了一種簡單而高效的解決方案:在每個優化步驟中,對抗性圖像按控制的比例和尺度隨機裁剪,調整大小,然後在嵌入空間中與目標圖像對齊。實驗結果證實了我們的假設。我們使用聚焦於關鍵區域的局部聚合擾動製作的對抗樣本,在包括GPT-4.5、GPT-4o、Gemini-2.0-flash、Claude-3.5-sonnet、Claude-3.7-sonnet以及推理模型如o1、Claude-3.7-thinking和Gemini-2.0-flash-thinking的商業LVLMs上展現出驚人的遷移能力。我們的方法在GPT-4.5、4o和o1上的成功率超過90%,顯著優於所有現有的最先進攻擊方法。我們在不同配置下優化的對抗樣本及訓練代碼可在https://github.com/VILA-Lab/M-Attack獲取。
English
Despite promising performance on open-source large vision-language models
(LVLMs), transfer-based targeted attacks often fail against black-box
commercial LVLMs. Analyzing failed adversarial perturbations reveals that the
learned perturbations typically originate from a uniform distribution and lack
clear semantic details, resulting in unintended responses. This critical
absence of semantic information leads commercial LVLMs to either ignore the
perturbation entirely or misinterpret its embedded semantics, thereby causing
the attack to fail. To overcome these issues, we notice that identifying core
semantic objects is a key objective for models trained with various datasets
and methodologies. This insight motivates our approach that refines semantic
clarity by encoding explicit semantic details within local regions, thus
ensuring interoperability and capturing finer-grained features, and by
concentrating modifications on semantically rich areas rather than applying
them uniformly. To achieve this, we propose a simple yet highly effective
solution: at each optimization step, the adversarial image is cropped randomly
by a controlled aspect ratio and scale, resized, and then aligned with the
target image in the embedding space. Experimental results confirm our
hypothesis. Our adversarial examples crafted with local-aggregated
perturbations focused on crucial regions exhibit surprisingly good
transferability to commercial LVLMs, including GPT-4.5, GPT-4o,
Gemini-2.0-flash, Claude-3.5-sonnet, Claude-3.7-sonnet, and even reasoning
models like o1, Claude-3.7-thinking and Gemini-2.0-flash-thinking. Our approach
achieves success rates exceeding 90% on GPT-4.5, 4o, and o1, significantly
outperforming all prior state-of-the-art attack methods. Our optimized
adversarial examples under different configurations and training code are
available at https://github.com/VILA-Lab/M-Attack.Summary
AI-Generated Summary