跨域鲁棒性:CLIP需配备鲁棒的文本编码器
Robustness in Both Domains: CLIP Needs a Robust Text Encoder
June 3, 2025
作者: Elias Abad Rocamora, Christian Schlarmann, Naman Deep Singh, Yongtao Wu, Matthias Hein, Volkan Cevher
cs.AI
摘要
对抗性输入攻击可能导致CLIP嵌入发生显著偏移。这会影响集成CLIP的模型在下游任务中的鲁棒性,例如文本到图像生成模型或大型视觉语言模型。尽管已有一些工作致力于提升CLIP图像编码器的鲁棒性,但文本编码器的鲁棒性仍未被充分探索。本研究填补了这一文献空白。我们提出了LEAF:一种针对文本领域的高效对抗性微调方法,能够扩展到大型CLIP模型。我们的模型显著提升了文本领域的零样本对抗准确性,同时保持了由鲁棒图像编码器提供的视觉性能。当与文本到图像扩散模型结合时,我们能够在对抗噪声下提升生成质量。在多模态检索任务中使用我们鲁棒的CLIP编码器时,相较于标准CLIP模型,我们在对抗噪声下的召回率有所提高。最后,我们展示了鲁棒文本编码器通过直接优化,能够更好地从嵌入中重建输入文本。
English
Adversarial input attacks can cause a significant shift of CLIP embeddings.
This can affect the downstream robustness of models incorporating CLIP in the
pipeline, such as text-to-image generative models or large vision language
models. While some efforts have been done towards making the CLIP image
encoders robust, the robustness of text encoders remains unexplored. In this
work, we cover this gap in the literature. We propose LEAF: an efficient
adversarial finetuning method for the text domain, with the ability to scale to
large CLIP models. Our models significantly improve the zero-shot adversarial
accuracy in the text domain, while maintaining the vision performance provided
by robust image encoders. When combined with text-to-image diffusion models, we
can improve the generation quality under adversarial noise. When employing our
robust CLIP encoders in multimodal retrieval tasks, we improve the recall under
adversarial noise over standard CLIP models. Finally, we show that robust text
encoders facilitate better reconstruction of input text from its embedding via
direct optimization.