大型語言模型能否欺騙CLIP？通過文本更新對預訓練多模態表徵的對抗組合性進行基準測試

摘要

儘管預訓練的多模態表示（如CLIP）展現了令人印象深刻的能力，它們卻存在顯著的組合脆弱性，導致反直覺的判斷。我們提出了多模態對抗組合性（MAC）這一基準，它利用大型語言模型（LLM）生成具有欺騙性的文本樣本，以跨不同模態利用這些脆弱性，並通過樣本級攻擊成功率和基於熵的多樣性進行評估。為了改進零樣本方法，我們提出了一種自訓練方法，該方法結合了拒絕採樣微調與促進多樣性的過濾策略，從而提升了攻擊成功率和樣本多樣性。使用如Llama-3.1-8B等較小的語言模型，我們的方法在揭示跨圖像、視頻和音頻等多種多模態表示的組合脆弱性方面表現出優越性能。

English

While pre-trained multimodal representations (e.g., CLIP) have shown impressive capabilities, they exhibit significant compositional vulnerabilities leading to counterintuitive judgments. We introduce Multimodal Adversarial Compositionality (MAC), a benchmark that leverages large language models (LLMs) to generate deceptive text samples to exploit these vulnerabilities across different modalities and evaluates them through both sample-wise attack success rate and group-wise entropy-based diversity. To improve zero-shot methods, we propose a self-training approach that leverages rejection-sampling fine-tuning with diversity-promoting filtering, which enhances both attack success rate and sample diversity. Using smaller language models like Llama-3.1-8B, our approach demonstrates superior performance in revealing compositional vulnerabilities across various multimodal representations, including images, videos, and audios.

大型語言模型能否欺騙CLIP？通過文本更新對預訓練多模態表徵的對抗組合性進行基準測試

Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates

摘要

Support