LLM이 CLIP을 속일 수 있는가? 텍스트 업데이트를 통한 사전 학습된 멀티모달 표현의 적대적 조합성 벤치마킹

초록

사전 학습된 다중모달 표현(예: CLIP)이 인상적인 능력을 보여주고 있지만, 이들은 직관에 반하는 판단을 유발하는 상당한 조합적 취약성을 드러냅니다. 우리는 다중모달 적대적 조합성(Multimodal Adversarial Compositionality, MAC)이라는 벤치마크를 소개합니다. 이는 대규모 언어 모델(LLMs)을 활용하여 다양한 모달리티 간의 이러한 취약성을 악용하는 기만적인 텍스트 샘플을 생성하고, 이를 샘플별 공격 성공률과 그룹별 엔트로피 기반 다양성을 통해 평가합니다. 제로샷 방법을 개선하기 위해, 우리는 다양성 촉진 필터링과 함께 거절 샘플링 미세 조정을 활용한 자기 학습 접근법을 제안하며, 이는 공격 성공률과 샘플 다양성 모두를 향상시킵니다. Llama-3.1-8B와 같은 소규모 언어 모델을 사용하여, 우리의 접근법은 이미지, 비디오, 오디오를 포함한 다양한 다중모달 표현에서 조합적 취약성을 드러내는 데 있어 우수한 성능을 보여줍니다.

English

While pre-trained multimodal representations (e.g., CLIP) have shown impressive capabilities, they exhibit significant compositional vulnerabilities leading to counterintuitive judgments. We introduce Multimodal Adversarial Compositionality (MAC), a benchmark that leverages large language models (LLMs) to generate deceptive text samples to exploit these vulnerabilities across different modalities and evaluates them through both sample-wise attack success rate and group-wise entropy-based diversity. To improve zero-shot methods, we propose a self-training approach that leverages rejection-sampling fine-tuning with diversity-promoting filtering, which enhances both attack success rate and sample diversity. Using smaller language models like Llama-3.1-8B, our approach demonstrates superior performance in revealing compositional vulnerabilities across various multimodal representations, including images, videos, and audios.

LLM이 CLIP을 속일 수 있는가? 텍스트 업데이트를 통한 사전 학습된 멀티모달 표현의 적대적 조합성 벤치마킹

Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates

초록

Support