ChatPaper.aiChatPaper

ProCLIP:基于大语言模型嵌入器的渐进式视觉-语言对齐

ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder

October 21, 2025
作者: Xiaoxing Hu, Kaicheng Yang, Ziyong Feng, Qi Ming, Zonghao Guo, Xiang An, Ziyong Feng, Junchi Yan, Xue Yang
cs.AI

摘要

原始CLIP文本编码器受限于77个token的最大输入长度,这限制了其有效处理长文本和进行细粒度语义理解的能力。此外,CLIP文本编码器缺乏对多语言输入的支持。这些限制显著制约了其在更广泛任务中的适用性。近期研究尝试用基于大语言模型(LLM)的嵌入器替换CLIP文本编码器,以增强其在处理长文本、多语言理解及细粒度语义理解方面的能力。然而,由于LLM的表示空间与CLIP的视觉-语言空间是独立预训练的,缺乏对齐先验,直接使用对比学习进行对齐可能会破坏CLIP图像编码器内在的视觉-语言对齐,导致预训练期间获得的知识未能充分利用。为解决这一挑战,我们提出了ProCLIP,一个基于课程学习的渐进式视觉-语言对齐框架,旨在有效对齐CLIP图像编码器与基于LLM的嵌入器。具体而言,ProCLIP首先从CLIP的文本编码器向基于LLM的嵌入器蒸馏知识,以利用CLIP丰富的预训练知识,同时建立LLM嵌入器与CLIP图像编码器之间的初始对齐。随后,ProCLIP通过图像-文本对比微调进一步对齐CLIP图像编码器与基于LLM的嵌入器,采用自蒸馏正则化以避免过拟合。为实现更有效的对齐,在表示继承和对比微调过程中,采用了实例语义对齐损失和嵌入结构对齐损失。代码已发布于https://github.com/VisionXLab/ProCLIP。
English
The original CLIP text encoder is limited by a maximum input length of 77 tokens, which hampers its ability to effectively process long texts and perform fine-grained semantic understanding. In addition, the CLIP text encoder lacks support for multilingual inputs. All these limitations significantly restrict its applicability across a broader range of tasks. Recent studies have attempted to replace the CLIP text encoder with an LLM-based embedder to enhance its ability in processing long texts, multilingual understanding, and fine-grained semantic comprehension. However, because the representation spaces of LLMs and the vision-language space of CLIP are pretrained independently without alignment priors, direct alignment using contrastive learning can disrupt the intrinsic vision-language alignment in the CLIP image encoder, leading to an underutilization of the knowledge acquired during pre-training. To address this challenge, we propose ProCLIP, a curriculum learning-based progressive vision-language alignment framework to effectively align the CLIP image encoder with an LLM-based embedder. Specifically, ProCLIP first distills knowledge from CLIP's text encoder into the LLM-based embedder to leverage CLIP's rich pretrained knowledge while establishing initial alignment between the LLM embedder and CLIP image encoder. Subsequently, ProCLIP further aligns the CLIP image encoder with the LLM-based embedder through image-text contrastive tuning, employing self-distillation regularization to avoid overfitting. To achieve a more effective alignment, instance semantic alignment loss and embedding structure alignment loss are employed during representation inheritance and contrastive tuning. The Code is available at https://github.com/VisionXLab/ProCLIP
PDF92October 22, 2025