跨領域的穩健性:CLIP 需要一個穩健的文本編碼器
Robustness in Both Domains: CLIP Needs a Robust Text Encoder
June 3, 2025
作者: Elias Abad Rocamora, Christian Schlarmann, Naman Deep Singh, Yongtao Wu, Matthias Hein, Volkan Cevher
cs.AI
摘要
對抗性輸入攻擊可能導致CLIP嵌入發生顯著偏移。這會影響到在流程中整合CLIP的模型的下游魯棒性,例如文本到圖像生成模型或大型視覺語言模型。雖然已有一些努力致力於提升CLIP圖像編碼器的魯棒性,但文本編碼器的魯棒性仍未被探索。在本研究中,我們填補了這一文獻空白。我們提出了LEAF:一種針對文本領域的高效對抗性微調方法,能夠擴展至大型CLIP模型。我們的模型顯著提升了文本領域的零樣本對抗準確率,同時保持了由魯棒圖像編碼器提供的視覺性能。當與文本到圖像擴散模型結合時,我們能夠在對抗性噪聲下提升生成質量。在多模態檢索任務中使用我們魯棒的CLIP編碼器時,我們在對抗性噪聲下的召回率優於標準CLIP模型。最後,我們展示了魯棒的文本編碼器通過直接優化,能夠更好地從其嵌入重建輸入文本。
English
Adversarial input attacks can cause a significant shift of CLIP embeddings.
This can affect the downstream robustness of models incorporating CLIP in the
pipeline, such as text-to-image generative models or large vision language
models. While some efforts have been done towards making the CLIP image
encoders robust, the robustness of text encoders remains unexplored. In this
work, we cover this gap in the literature. We propose LEAF: an efficient
adversarial finetuning method for the text domain, with the ability to scale to
large CLIP models. Our models significantly improve the zero-shot adversarial
accuracy in the text domain, while maintaining the vision performance provided
by robust image encoders. When combined with text-to-image diffusion models, we
can improve the generation quality under adversarial noise. When employing our
robust CLIP encoders in multimodal retrieval tasks, we improve the recall under
adversarial noise over standard CLIP models. Finally, we show that robust text
encoders facilitate better reconstruction of input text from its embedding via
direct optimization.