양 영역에서의 견고성: CLIP은 견고한 텍스트 인코더가 필요하다

초록

적대적 입력 공격은 CLIP 임베딩의 상당한 변화를 초래할 수 있습니다. 이는 텍스트-이미지 생성 모델이나 대규모 시각-언어 모델과 같이 파이프라인에 CLIP을 통합한 모델의 다운스트림 강건성에 영향을 미칠 수 있습니다. CLIP 이미지 인코더를 강건하게 만들기 위한 일부 노력이 이루어졌지만, 텍스트 인코더의 강건성은 아직 탐구되지 않았습니다. 본 연구에서는 이러한 문헌상의 공백을 메우고자 합니다. 우리는 LEAF를 제안합니다: 이는 텍스트 도메인에서 효율적인 적대적 미세 조정 방법으로, 대규모 CLIP 모델로 확장할 수 있는 능력을 갖추고 있습니다. 우리의 모델은 텍스트 도메인에서 제로샷 적대적 정확도를 크게 향상시키면서도, 강건한 이미지 인코더가 제공하는 시각 성능을 유지합니다. 텍스트-이미지 확산 모델과 결합할 때, 적대적 노이즈 하에서의 생성 품질을 개선할 수 있습니다. 다중모드 검색 작업에서 우리의 강건한 CLIP 인코더를 사용할 때, 표준 CLIP 모델 대비 적대적 노이즈 하에서의 재현율을 향상시킵니다. 마지막으로, 강건한 텍스트 인코더가 직접 최적화를 통해 입력 텍스트의 임베딩으로부터 더 나은 재구성을 가능하게 한다는 것을 보여줍니다.

English

Adversarial input attacks can cause a significant shift of CLIP embeddings. This can affect the downstream robustness of models incorporating CLIP in the pipeline, such as text-to-image generative models or large vision language models. While some efforts have been done towards making the CLIP image encoders robust, the robustness of text encoders remains unexplored. In this work, we cover this gap in the literature. We propose LEAF: an efficient adversarial finetuning method for the text domain, with the ability to scale to large CLIP models. Our models significantly improve the zero-shot adversarial accuracy in the text domain, while maintaining the vision performance provided by robust image encoders. When combined with text-to-image diffusion models, we can improve the generation quality under adversarial noise. When employing our robust CLIP encoders in multimodal retrieval tasks, we improve the recall under adversarial noise over standard CLIP models. Finally, we show that robust text encoders facilitate better reconstruction of input text from its embedding via direct optimization.

양 영역에서의 견고성: CLIP은 견고한 텍스트 인코더가 필요하다

Robustness in Both Domains: CLIP Needs a Robust Text Encoder

초록

Support