大型語言模型能否欺騙CLIP?通過文本更新對預訓練多模態表徵的對抗組合性進行基準測試
Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates
May 28, 2025
作者: Jaewoo Ahn, Heeseung Yun, Dayoon Ko, Gunhee Kim
cs.AI
摘要
儘管預訓練的多模態表示(如CLIP)展現了令人印象深刻的能力,它們卻存在顯著的組合脆弱性,導致反直覺的判斷。我們提出了多模態對抗組合性(MAC)這一基準,它利用大型語言模型(LLM)生成具有欺騙性的文本樣本,以跨不同模態利用這些脆弱性,並通過樣本級攻擊成功率和基於熵的多樣性進行評估。為了改進零樣本方法,我們提出了一種自訓練方法,該方法結合了拒絕採樣微調與促進多樣性的過濾策略,從而提升了攻擊成功率和樣本多樣性。使用如Llama-3.1-8B等較小的語言模型,我們的方法在揭示跨圖像、視頻和音頻等多種多模態表示的組合脆弱性方面表現出優越性能。
English
While pre-trained multimodal representations (e.g., CLIP) have shown
impressive capabilities, they exhibit significant compositional vulnerabilities
leading to counterintuitive judgments. We introduce Multimodal Adversarial
Compositionality (MAC), a benchmark that leverages large language models (LLMs)
to generate deceptive text samples to exploit these vulnerabilities across
different modalities and evaluates them through both sample-wise attack success
rate and group-wise entropy-based diversity. To improve zero-shot methods, we
propose a self-training approach that leverages rejection-sampling fine-tuning
with diversity-promoting filtering, which enhances both attack success rate and
sample diversity. Using smaller language models like Llama-3.1-8B, our approach
demonstrates superior performance in revealing compositional vulnerabilities
across various multimodal representations, including images, videos, and
audios.Summary
AI-Generated Summary