폐쇄형 다중 모달 대형 언어 모델에 대한 특징 최적 정렬 기반 적대적 공격

초록

멀티모달 대형 언어 모델(MLLMs)은 전이 가능한 적대적 예제에 취약한 상태로 남아 있습니다. 기존 방법들은 일반적으로 CLIP의 [CLS] 토큰과 같은 전역 특징을 적대적 샘플과 타겟 샘플 간에 정렬함으로써 표적 공격을 달성하지만, 패치 토큰에 인코딩된 풍부한 지역 정보를 간과하는 경우가 많습니다. 이는 특히 폐쇄형 모델에서 최적의 정렬과 전이성을 제한하는 결과를 초래합니다. 이러한 한계를 해결하기 위해, 우리는 특징 최적 정렬 기반의 표적 전이 가능 적대적 공격 방법인 FOA-Attack을 제안하여 적대적 전이 능력을 향상시킵니다. 구체적으로, 전역 수준에서는 코사인 유사도 기반의 전역 특징 손실을 도입하여 적대적 샘플과 타겟 샘플의 거시적 특징을 정렬합니다. 지역 수준에서는 트랜스포머 내부의 풍부한 지역 표현을 고려하여 클러스터링 기법을 활용하여 중복된 지역 특징을 완화하고, 적대적 샘플과 타겟 샘플 간의 지역 특징 정렬을 최적 수송(OT) 문제로 공식화하여 지역 클러스터링 최적 수송 손실을 제안하여 미시적 특징 정렬을 개선합니다. 또한, 적대적 예제 생성 과정에서 여러 모델의 영향을 적응적으로 균형 잡기 위한 동적 앙상블 모델 가중치 전략을 제안하여 전이성을 더욱 향상시킵니다. 다양한 모델에 걸친 광범위한 실험을 통해 제안된 방법의 우수성을 입증하였으며, 특히 폐쇄형 MLLMs로의 전이에서 최신 방법들을 능가하는 성능을 보였습니다. 코드는 https://github.com/jiaxiaojunQAQ/FOA-Attack에서 공개되었습니다.

English

Multimodal large language models (MLLMs) remain vulnerable to transferable adversarial examples. While existing methods typically achieve targeted attacks by aligning global features-such as CLIP's [CLS] token-between adversarial and target samples, they often overlook the rich local information encoded in patch tokens. This leads to suboptimal alignment and limited transferability, particularly for closed-source models. To address this limitation, we propose a targeted transferable adversarial attack method based on feature optimal alignment, called FOA-Attack, to improve adversarial transfer capability. Specifically, at the global level, we introduce a global feature loss based on cosine similarity to align the coarse-grained features of adversarial samples with those of target samples. At the local level, given the rich local representations within Transformers, we leverage clustering techniques to extract compact local patterns to alleviate redundant local features. We then formulate local feature alignment between adversarial and target samples as an optimal transport (OT) problem and propose a local clustering optimal transport loss to refine fine-grained feature alignment. Additionally, we propose a dynamic ensemble model weighting strategy to adaptively balance the influence of multiple models during adversarial example generation, thereby further improving transferability. Extensive experiments across various models demonstrate the superiority of the proposed method, outperforming state-of-the-art methods, especially in transferring to closed-source MLLMs. The code is released at https://github.com/jiaxiaojunQAQ/FOA-Attack.

폐쇄형 다중 모달 대형 언어 모델에 대한 특징 최적 정렬 기반 적대적 공격

Adversarial Attacks against Closed-Source MLLMs via Feature Optimal Alignment

초록

Support