跨域鲁棒性：CLIP需配备鲁棒的文本编码器

摘要

对抗性输入攻击可能导致CLIP嵌入发生显著偏移。这会影响集成CLIP的模型在下游任务中的鲁棒性，例如文本到图像生成模型或大型视觉语言模型。尽管已有一些工作致力于提升CLIP图像编码器的鲁棒性，但文本编码器的鲁棒性仍未被充分探索。本研究填补了这一文献空白。我们提出了LEAF：一种针对文本领域的高效对抗性微调方法，能够扩展到大型CLIP模型。我们的模型显著提升了文本领域的零样本对抗准确性，同时保持了由鲁棒图像编码器提供的视觉性能。当与文本到图像扩散模型结合时，我们能够在对抗噪声下提升生成质量。在多模态检索任务中使用我们鲁棒的CLIP编码器时，相较于标准CLIP模型，我们在对抗噪声下的召回率有所提高。最后，我们展示了鲁棒文本编码器通过直接优化，能够更好地从嵌入中重建输入文本。

English

Adversarial input attacks can cause a significant shift of CLIP embeddings. This can affect the downstream robustness of models incorporating CLIP in the pipeline, such as text-to-image generative models or large vision language models. While some efforts have been done towards making the CLIP image encoders robust, the robustness of text encoders remains unexplored. In this work, we cover this gap in the literature. We propose LEAF: an efficient adversarial finetuning method for the text domain, with the ability to scale to large CLIP models. Our models significantly improve the zero-shot adversarial accuracy in the text domain, while maintaining the vision performance provided by robust image encoders. When combined with text-to-image diffusion models, we can improve the generation quality under adversarial noise. When employing our robust CLIP encoders in multimodal retrieval tasks, we improve the recall under adversarial noise over standard CLIP models. Finally, we show that robust text encoders facilitate better reconstruction of input text from its embedding via direct optimization.

跨域鲁棒性：CLIP需配备鲁棒的文本编码器

Robustness in Both Domains: CLIP Needs a Robust Text Encoder

摘要

Support