ChatPaper.aiChatPaper

大型語言模型能否欺騙CLIP?通過文本更新對預訓練多模態表徵的對抗組合性進行基準測試

Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates

May 28, 2025
作者: Jaewoo Ahn, Heeseung Yun, Dayoon Ko, Gunhee Kim
cs.AI

摘要

儘管預訓練的多模態表示(如CLIP)展現了令人印象深刻的能力,它們卻存在顯著的組合脆弱性,導致反直覺的判斷。我們提出了多模態對抗組合性(MAC)這一基準,它利用大型語言模型(LLM)生成具有欺騙性的文本樣本,以跨不同模態利用這些脆弱性,並通過樣本級攻擊成功率和基於熵的多樣性進行評估。為了改進零樣本方法,我們提出了一種自訓練方法,該方法結合了拒絕採樣微調與促進多樣性的過濾策略,從而提升了攻擊成功率和樣本多樣性。使用如Llama-3.1-8B等較小的語言模型,我們的方法在揭示跨圖像、視頻和音頻等多種多模態表示的組合脆弱性方面表現出優越性能。
English
While pre-trained multimodal representations (e.g., CLIP) have shown impressive capabilities, they exhibit significant compositional vulnerabilities leading to counterintuitive judgments. We introduce Multimodal Adversarial Compositionality (MAC), a benchmark that leverages large language models (LLMs) to generate deceptive text samples to exploit these vulnerabilities across different modalities and evaluates them through both sample-wise attack success rate and group-wise entropy-based diversity. To improve zero-shot methods, we propose a self-training approach that leverages rejection-sampling fine-tuning with diversity-promoting filtering, which enhances both attack success rate and sample diversity. Using smaller language models like Llama-3.1-8B, our approach demonstrates superior performance in revealing compositional vulnerabilities across various multimodal representations, including images, videos, and audios.

Summary

AI-Generated Summary

PDF44May 30, 2025