跨領域的穩健性：CLIP 需要一個穩健的文本編碼器

摘要

對抗性輸入攻擊可能導致CLIP嵌入發生顯著偏移。這會影響到在流程中整合CLIP的模型的下游魯棒性，例如文本到圖像生成模型或大型視覺語言模型。雖然已有一些努力致力於提升CLIP圖像編碼器的魯棒性，但文本編碼器的魯棒性仍未被探索。在本研究中，我們填補了這一文獻空白。我們提出了LEAF：一種針對文本領域的高效對抗性微調方法，能夠擴展至大型CLIP模型。我們的模型顯著提升了文本領域的零樣本對抗準確率，同時保持了由魯棒圖像編碼器提供的視覺性能。當與文本到圖像擴散模型結合時，我們能夠在對抗性噪聲下提升生成質量。在多模態檢索任務中使用我們魯棒的CLIP編碼器時，我們在對抗性噪聲下的召回率優於標準CLIP模型。最後，我們展示了魯棒的文本編碼器通過直接優化，能夠更好地從其嵌入重建輸入文本。

English

Adversarial input attacks can cause a significant shift of CLIP embeddings. This can affect the downstream robustness of models incorporating CLIP in the pipeline, such as text-to-image generative models or large vision language models. While some efforts have been done towards making the CLIP image encoders robust, the robustness of text encoders remains unexplored. In this work, we cover this gap in the literature. We propose LEAF: an efficient adversarial finetuning method for the text domain, with the ability to scale to large CLIP models. Our models significantly improve the zero-shot adversarial accuracy in the text domain, while maintaining the vision performance provided by robust image encoders. When combined with text-to-image diffusion models, we can improve the generation quality under adversarial noise. When employing our robust CLIP encoders in multimodal retrieval tasks, we improve the recall under adversarial noise over standard CLIP models. Finally, we show that robust text encoders facilitate better reconstruction of input text from its embedding via direct optimization.

跨領域的穩健性：CLIP 需要一個穩健的文本編碼器

Robustness in Both Domains: CLIP Needs a Robust Text Encoder

摘要

Support