ChatPaper.aiChatPaper

大型语言模型能否欺骗CLIP?通过文本更新对预训练多模态表示的反向组合性进行基准测试

Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates

May 28, 2025
作者: Jaewoo Ahn, Heeseung Yun, Dayoon Ko, Gunhee Kim
cs.AI

摘要

尽管预训练的多模态表示(如CLIP)展现了卓越的能力,它们却存在显著的组合脆弱性,导致反直觉的判断。我们提出了多模态对抗组合性(MAC)基准,该基准利用大型语言模型(LLMs)生成欺骗性文本样本,以跨不同模态挖掘这些脆弱性,并通过样本级攻击成功率和基于熵的群体多样性进行评估。为了提升零样本方法,我们提出了一种自训练策略,采用拒绝采样微调与促进多样性的过滤机制,从而同时提高攻击成功率和样本多样性。使用如Llama-3.1-8B等较小的语言模型,我们的方法在揭示涵盖图像、视频和音频等多种多模态表示的组合脆弱性方面,展现出优越性能。
English
While pre-trained multimodal representations (e.g., CLIP) have shown impressive capabilities, they exhibit significant compositional vulnerabilities leading to counterintuitive judgments. We introduce Multimodal Adversarial Compositionality (MAC), a benchmark that leverages large language models (LLMs) to generate deceptive text samples to exploit these vulnerabilities across different modalities and evaluates them through both sample-wise attack success rate and group-wise entropy-based diversity. To improve zero-shot methods, we propose a self-training approach that leverages rejection-sampling fine-tuning with diversity-promoting filtering, which enhances both attack success rate and sample diversity. Using smaller language models like Llama-3.1-8B, our approach demonstrates superior performance in revealing compositional vulnerabilities across various multimodal representations, including images, videos, and audios.

Summary

AI-Generated Summary

PDF44May 30, 2025