一個令人沮喪卻極為有效的攻擊基準：對抗GPT-4.5/4o/o1等強大黑箱模型，成功率超過90%

摘要

儘管開源的大型視覺語言模型（LVLMs）表現出色，但基於遷移的針對性攻擊在面對黑箱商業LVLMs時往往失敗。分析失敗的對抗性擾動發現，這些擾動通常源自均勻分佈，缺乏清晰的語義細節，導致模型產生非預期的回應。這種語義信息的關鍵缺失使得商業LVLMs要麼完全忽略擾動，要麼誤解其嵌入的語義，從而導致攻擊失敗。為解決這些問題，我們注意到識別核心語義對象是使用多種數據集和方法訓練的模型的關鍵目標。這一洞察激發了我們的方法，即通過在局部區域編碼明確的語義細節來提升語義清晰度，從而確保互操作性並捕捉更細粒度的特徵，並將修改集中在語義豐富的區域而非均勻應用。為實現這一點，我們提出了一種簡單而高效的解決方案：在每個優化步驟中，對抗性圖像按控制的比例和尺度隨機裁剪，調整大小，然後在嵌入空間中與目標圖像對齊。實驗結果證實了我們的假設。我們使用聚焦於關鍵區域的局部聚合擾動製作的對抗樣本，在包括GPT-4.5、GPT-4o、Gemini-2.0-flash、Claude-3.5-sonnet、Claude-3.7-sonnet以及推理模型如o1、Claude-3.7-thinking和Gemini-2.0-flash-thinking的商業LVLMs上展現出驚人的遷移能力。我們的方法在GPT-4.5、4o和o1上的成功率超過90%，顯著優於所有現有的最先進攻擊方法。我們在不同配置下優化的對抗樣本及訓練代碼可在https://github.com/VILA-Lab/M-Attack獲取。

English

Despite promising performance on open-source large vision-language models (LVLMs), transfer-based targeted attacks often fail against black-box commercial LVLMs. Analyzing failed adversarial perturbations reveals that the learned perturbations typically originate from a uniform distribution and lack clear semantic details, resulting in unintended responses. This critical absence of semantic information leads commercial LVLMs to either ignore the perturbation entirely or misinterpret its embedded semantics, thereby causing the attack to fail. To overcome these issues, we notice that identifying core semantic objects is a key objective for models trained with various datasets and methodologies. This insight motivates our approach that refines semantic clarity by encoding explicit semantic details within local regions, thus ensuring interoperability and capturing finer-grained features, and by concentrating modifications on semantically rich areas rather than applying them uniformly. To achieve this, we propose a simple yet highly effective solution: at each optimization step, the adversarial image is cropped randomly by a controlled aspect ratio and scale, resized, and then aligned with the target image in the embedding space. Experimental results confirm our hypothesis. Our adversarial examples crafted with local-aggregated perturbations focused on crucial regions exhibit surprisingly good transferability to commercial LVLMs, including GPT-4.5, GPT-4o, Gemini-2.0-flash, Claude-3.5-sonnet, Claude-3.7-sonnet, and even reasoning models like o1, Claude-3.7-thinking and Gemini-2.0-flash-thinking. Our approach achieves success rates exceeding 90% on GPT-4.5, 4o, and o1, significantly outperforming all prior state-of-the-art attack methods. Our optimized adversarial examples under different configurations and training code are available at https://github.com/VILA-Lab/M-Attack.

一個令人沮喪卻極為有效的攻擊基準：對抗GPT-4.5/4o/o1等強大黑箱模型，成功率超過90%

A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1

摘要

Support