ChatPaper.aiChatPaper

TIPSv2:通过增强的补丁-文本对齐技术推进视觉语言预训练

TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment

April 13, 2026
作者: Bingyi Cao, Koert Chen, Kevis-Kokitsi Maninis, Kaifeng Chen, Arjun Karpur, Ye Xia, Sahil Dua, Tanmaya Dabral, Guangxing Han, Bohyung Han, Joshua Ainslie, Alex Bewley, Mithun Jacob, René Wagner, Washington Ramos, Krzysztof Choromanski, Mojtaba Seyedhosseini, Howard Zhou, André Araujo
cs.AI

摘要

视觉-语言预训练的最新进展显著提升了诸多下游计算机视觉应用的性能,如分类、检索、分割和深度预测。然而,这些模型仍难以实现密集图像块表征与对应概念文本嵌入的精准对齐。本文针对这一核心问题展开研究,提出了增强基础视觉-语言模型该能力的新技术。首先,我们发现图像块级蒸馏方法能显著提升密集图文对齐能力——令人惊讶的是,蒸馏后学生模型的图文对齐能力甚至显著超越教师模型。这一现象启发我们改进预训练方案,由此提出iBOT++:对常用iBOT掩码图像目标函数的升级版本,使未掩码标记也能直接参与损失计算。该方法大幅提升了预训练模型的图文对齐能力。此外,为提升视觉-语言预训练的效率和效果,我们改进了学习方案中的指数移动平均设置,并引入描述语采样策略以利用不同粒度的合成描述语。整合这些组件后,我们开发了TIPSv2——适用于广泛下游应用的图文编码器新模型系列。通过在9项任务、20个数据集上的综合实验,模型展现出强劲性能,普遍达到或超越了近期视觉编码器模型的水平。代码与模型已通过项目页面https://gdm-tipsv2.github.io/发布。
English
Recent progress in vision-language pretraining has enabled significant improvements to many downstream computer vision applications, such as classification, retrieval, segmentation and depth prediction. However, a fundamental capability that these models still struggle with is aligning dense patch representations with text embeddings of corresponding concepts. In this work, we investigate this critical issue and propose novel techniques to enhance this capability in foundational vision-language models. First, we reveal that a patch-level distillation procedure significantly boosts dense patch-text alignment -- surprisingly, the patch-text alignment of the distilled student model strongly surpasses that of the teacher model. This observation inspires us to consider modifications to pretraining recipes, leading us to propose iBOT++, an upgrade to the commonly-used iBOT masked image objective, where unmasked tokens also contribute directly to the loss. This dramatically enhances patch-text alignment of pretrained models. Additionally, to improve vision-language pretraining efficiency and effectiveness, we modify the exponential moving average setup in the learning recipe, and introduce a caption sampling strategy to benefit from synthetic captions at different granularities. Combining these components, we develop TIPSv2, a new family of image-text encoder models suitable for a wide range of downstream applications. Through comprehensive experiments on 9 tasks and 20 datasets, we demonstrate strong performance, generally on par with or better than recent vision encoder models. Code and models are released via our project page at https://gdm-tipsv2.github.io/ .