LLMはCLIPを欺けるか？テキスト更新による事前学習マルチモーダル表現の敵対的合成性のベンチマーキング

要旨

事前学習されたマルチモーダル表現（例：CLIP）は印象的な能力を示す一方で、直感に反する判断を引き起こす重要な構成上の脆弱性を有しています。本論文では、マルチモーダル敵対的構成性（Multimodal Adversarial Compositionality, MAC）を提案します。MACは、大規模言語モデル（LLMs）を活用して異なるモダリティにわたるこれらの脆弱性を悪用する欺瞞的なテキストサンプルを生成し、サンプルごとの攻撃成功率とグループごとのエントロピーベースの多様性を通じて評価するベンチマークです。ゼロショット手法を改善するために、多様性を促進するフィルタリングを伴うリジェクトサンプリングによるファインチューニングを活用した自己学習アプローチを提案し、攻撃成功率とサンプル多様性の両方を向上させます。Llama-3.1-8Bのような小規模言語モデルを使用することで、画像、動画、音声を含む様々なマルチモーダル表現における構成上の脆弱性を明らかにする優れた性能を実証しています。

English

While pre-trained multimodal representations (e.g., CLIP) have shown impressive capabilities, they exhibit significant compositional vulnerabilities leading to counterintuitive judgments. We introduce Multimodal Adversarial Compositionality (MAC), a benchmark that leverages large language models (LLMs) to generate deceptive text samples to exploit these vulnerabilities across different modalities and evaluates them through both sample-wise attack success rate and group-wise entropy-based diversity. To improve zero-shot methods, we propose a self-training approach that leverages rejection-sampling fine-tuning with diversity-promoting filtering, which enhances both attack success rate and sample diversity. Using smaller language models like Llama-3.1-8B, our approach demonstrates superior performance in revealing compositional vulnerabilities across various multimodal representations, including images, videos, and audios.

LLMはCLIPを欺けるか？テキスト更新による事前学習マルチモーダル表現の敵対的合成性のベンチマーキング

Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates

要旨

Support