大型语言模型能否欺骗CLIP?通过文本更新对预训练多模态表示的反向组合性进行基准测试
Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates
May 28, 2025
作者: Jaewoo Ahn, Heeseung Yun, Dayoon Ko, Gunhee Kim
cs.AI
摘要
尽管预训练的多模态表示(如CLIP)展现了卓越的能力,它们却存在显著的组合脆弱性,导致反直觉的判断。我们提出了多模态对抗组合性(MAC)基准,该基准利用大型语言模型(LLMs)生成欺骗性文本样本,以跨不同模态挖掘这些脆弱性,并通过样本级攻击成功率和基于熵的群体多样性进行评估。为了提升零样本方法,我们提出了一种自训练策略,采用拒绝采样微调与促进多样性的过滤机制,从而同时提高攻击成功率和样本多样性。使用如Llama-3.1-8B等较小的语言模型,我们的方法在揭示涵盖图像、视频和音频等多种多模态表示的组合脆弱性方面,展现出优越性能。
English
While pre-trained multimodal representations (e.g., CLIP) have shown
impressive capabilities, they exhibit significant compositional vulnerabilities
leading to counterintuitive judgments. We introduce Multimodal Adversarial
Compositionality (MAC), a benchmark that leverages large language models (LLMs)
to generate deceptive text samples to exploit these vulnerabilities across
different modalities and evaluates them through both sample-wise attack success
rate and group-wise entropy-based diversity. To improve zero-shot methods, we
propose a self-training approach that leverages rejection-sampling fine-tuning
with diversity-promoting filtering, which enhances both attack success rate and
sample diversity. Using smaller language models like Llama-3.1-8B, our approach
demonstrates superior performance in revealing compositional vulnerabilities
across various multimodal representations, including images, videos, and
audios.Summary
AI-Generated Summary