両領域におけるロバスト性：CLIPにはロバストなテキストエンコーダが必要である

要旨

敵対的入力攻撃は、CLIP埋め込みに大きな変化を引き起こす可能性がある。これは、テキストから画像を生成するモデルや大規模視覚言語モデルなど、パイプラインにCLIPを組み込んだモデルの下流の頑健性に影響を与える。CLIP画像エンコーダの頑健性を向上させるためのいくつかの取り組みが行われているが、テキストエンコーダの頑健性は未だに検討されていない。本研究では、この文献上のギャップを埋める。我々は、テキスト領域における効率的な敵対的ファインチューニング手法であるLEAFを提案し、大規模なCLIPモデルにスケールする能力を持つ。我々のモデルは、頑健な画像エンコーダが提供する視覚性能を維持しつつ、テキスト領域におけるゼロショット敵対的精度を大幅に向上させる。テキストから画像を生成する拡散モデルと組み合わせることで、敵対的ノイズ下での生成品質を向上させることができる。また、マルチモーダル検索タスクにおいて我々の頑健なCLIPエンコーダを使用することで、標準的なCLIPモデルと比較して敵対的ノイズ下でのリコールを改善する。最後に、頑健なテキストエンコーダが、直接最適化を介して入力テキストの埋め込みからの再構築を容易にすることを示す。

English

Adversarial input attacks can cause a significant shift of CLIP embeddings. This can affect the downstream robustness of models incorporating CLIP in the pipeline, such as text-to-image generative models or large vision language models. While some efforts have been done towards making the CLIP image encoders robust, the robustness of text encoders remains unexplored. In this work, we cover this gap in the literature. We propose LEAF: an efficient adversarial finetuning method for the text domain, with the ability to scale to large CLIP models. Our models significantly improve the zero-shot adversarial accuracy in the text domain, while maintaining the vision performance provided by robust image encoders. When combined with text-to-image diffusion models, we can improve the generation quality under adversarial noise. When employing our robust CLIP encoders in multimodal retrieval tasks, we improve the recall under adversarial noise over standard CLIP models. Finally, we show that robust text encoders facilitate better reconstruction of input text from its embedding via direct optimization.

両領域におけるロバスト性：CLIPにはロバストなテキストエンコーダが必要である

Robustness in Both Domains: CLIP Needs a Robust Text Encoder

要旨

Support