ChatPaper.aiChatPaper

基于区域聚类的视觉表征学习判别方法

Region-based Cluster Discrimination for Visual Representation Learning

July 26, 2025
作者: Yin Xie, Kaicheng Yang, Xiang An, Kun Wu, Yongle Zhao, Weimo Deng, Zimin Ran, Yumeng Wang, Ziyong Feng, Roy Miles, Ismail Elezi, Jiankang Deng
cs.AI

摘要

学习视觉表征是众多下游任务的基础。尽管近期如CLIP和SigLIP等视觉-语言对比模型通过大规模视觉-语言对齐实现了令人瞩目的零样本性能,但它们对全局表征的依赖限制了其在密集预测任务(如定位、OCR和分割)中的有效性。为弥补这一不足,我们提出了区域感知聚类判别(RICE)这一新方法,旨在增强区域级别的视觉与OCR能力。我们首先构建了一个包含十亿级候选区域的数据集,并引入区域Transformer层以提取丰富的区域语义信息。进一步地,我们设计了一种统一的区域聚类判别损失函数,该函数在一个分类框架内同时支持物体与OCR学习,从而实现了在大规模数据上的高效可扩展分布式训练。大量实验表明,RICE在包括分割、密集检测及多模态大语言模型(MLLMs)视觉感知等任务上,均持续超越先前方法。预训练模型已发布于https://github.com/deepglint/MVT。
English
Learning visual representations is foundational for a broad spectrum of downstream tasks. Although recent vision-language contrastive models, such as CLIP and SigLIP, have achieved impressive zero-shot performance via large-scale vision-language alignment, their reliance on global representations constrains their effectiveness for dense prediction tasks, such as grounding, OCR, and segmentation. To address this gap, we introduce Region-Aware Cluster Discrimination (RICE), a novel method that enhances region-level visual and OCR capabilities. We first construct a billion-scale candidate region dataset and propose a Region Transformer layer to extract rich regional semantics. We further design a unified region cluster discrimination loss that jointly supports object and OCR learning within a single classification framework, enabling efficient and scalable distributed training on large-scale data. Extensive experiments show that RICE consistently outperforms previous methods on tasks, including segmentation, dense detection, and visual perception for Multimodal Large Language Models (MLLMs). The pre-trained models have been released at https://github.com/deepglint/MVT.
PDF153July 29, 2025