針對閉源多模態大語言模型的特徵最優對齊對抗攻擊

摘要

多模態大型語言模型（MLLMs）仍易受可遷移對抗樣本的影響。現有方法通常通過對齊全局特徵（如CLIP的[CLS]標記）來實現目標攻擊，但往往忽略了嵌入在補丁標記中的豐富局部信息。這導致對齊效果欠佳且遷移能力有限，尤其對於閉源模型。為解決這一局限，我們提出了一種基於特徵最優對齊的目標可遷移對抗攻擊方法，稱為FOA-Attack，以提升對抗遷移能力。具體而言，在全局層面，我們引入基於餘弦相似度的全局特徵損失，以對齊對抗樣本與目標樣本的粗粒度特徵。在局部層面，考慮到Transformer內豐富的局部表示，我們利用聚類技術提取緊湊的局部模式，以緩解冗餘局部特徵。隨後，我們將對抗樣本與目標樣本間的局部特徵對齊表述為最優傳輸（OT）問題，並提出局部聚類最優傳輸損失，以精細化細粒度特徵對齊。此外，我們提出了一種動態集成模型權重策略，在對抗樣本生成過程中自適應平衡多個模型的影響，從而進一步提升遷移能力。跨多種模型的廣泛實驗證明了所提方法的優越性，尤其在遷移至閉源MLLMs時，其表現超越了現有最先進的方法。代碼已發佈於https://github.com/jiaxiaojunQAQ/FOA-Attack。

English

Multimodal large language models (MLLMs) remain vulnerable to transferable adversarial examples. While existing methods typically achieve targeted attacks by aligning global features-such as CLIP's [CLS] token-between adversarial and target samples, they often overlook the rich local information encoded in patch tokens. This leads to suboptimal alignment and limited transferability, particularly for closed-source models. To address this limitation, we propose a targeted transferable adversarial attack method based on feature optimal alignment, called FOA-Attack, to improve adversarial transfer capability. Specifically, at the global level, we introduce a global feature loss based on cosine similarity to align the coarse-grained features of adversarial samples with those of target samples. At the local level, given the rich local representations within Transformers, we leverage clustering techniques to extract compact local patterns to alleviate redundant local features. We then formulate local feature alignment between adversarial and target samples as an optimal transport (OT) problem and propose a local clustering optimal transport loss to refine fine-grained feature alignment. Additionally, we propose a dynamic ensemble model weighting strategy to adaptively balance the influence of multiple models during adversarial example generation, thereby further improving transferability. Extensive experiments across various models demonstrate the superiority of the proposed method, outperforming state-of-the-art methods, especially in transferring to closed-source MLLMs. The code is released at https://github.com/jiaxiaojunQAQ/FOA-Attack.

針對閉源多模態大語言模型的特徵最優對齊對抗攻擊

Adversarial Attacks against Closed-Source MLLMs via Feature Optimal Alignment

摘要

Support