大型语言模型能否欺骗CLIP？通过文本更新对预训练多模态表示的反向组合性进行基准测试

摘要

尽管预训练的多模态表示（如CLIP）展现了卓越的能力，它们却存在显著的组合脆弱性，导致反直觉的判断。我们提出了多模态对抗组合性（MAC）基准，该基准利用大型语言模型（LLMs）生成欺骗性文本样本，以跨不同模态挖掘这些脆弱性，并通过样本级攻击成功率和基于熵的群体多样性进行评估。为了提升零样本方法，我们提出了一种自训练策略，采用拒绝采样微调与促进多样性的过滤机制，从而同时提高攻击成功率和样本多样性。使用如Llama-3.1-8B等较小的语言模型，我们的方法在揭示涵盖图像、视频和音频等多种多模态表示的组合脆弱性方面，展现出优越性能。

English

While pre-trained multimodal representations (e.g., CLIP) have shown impressive capabilities, they exhibit significant compositional vulnerabilities leading to counterintuitive judgments. We introduce Multimodal Adversarial Compositionality (MAC), a benchmark that leverages large language models (LLMs) to generate deceptive text samples to exploit these vulnerabilities across different modalities and evaluates them through both sample-wise attack success rate and group-wise entropy-based diversity. To improve zero-shot methods, we propose a self-training approach that leverages rejection-sampling fine-tuning with diversity-promoting filtering, which enhances both attack success rate and sample diversity. Using smaller language models like Llama-3.1-8B, our approach demonstrates superior performance in revealing compositional vulnerabilities across various multimodal representations, including images, videos, and audios.

大型语言模型能否欺骗CLIP？通过文本更新对预训练多模态表示的反向组合性进行基准测试

Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates

摘要

Support