시각적 표현 학습을 위한 영역 기반 클러스터 분별

초록

시각적 표현 학습은 다양한 다운스트림 작업의 기초가 됩니다. 최근 CLIP과 SigLIP과 같은 시각-언어 대조 모델들이 대규모 시각-언어 정렬을 통해 인상적인 제로샷 성능을 달성했지만, 이들의 전역적 표현에 대한 의존성은 grounding, OCR, 세분화와 같은 밀집 예측 작업에서의 효과를 제한합니다. 이러한 격차를 해결하기 위해, 우리는 지역 수준의 시각적 및 OCR 능력을 향상시키는 새로운 방법인 Region-Aware Cluster Discrimination (RICE)을 소개합니다. 먼저, 우리는 10억 규모의 후보 지역 데이터셋을 구축하고, 풍부한 지역 의미를 추출하기 위한 Region Transformer 레이어를 제안합니다. 더 나아가, 우리는 단일 분류 프레임워크 내에서 객체와 OCR 학습을 동시에 지원하는 통합 지역 클러스터 판별 손실을 설계하여, 대규모 데이터에 대한 효율적이고 확장 가능한 분산 학습을 가능하게 합니다. 광범위한 실험 결과, RICE는 세분화, 밀집 탐지, 그리고 Multimodal Large Language Models (MLLMs)을 위한 시각적 인지 작업에서 이전 방법들을 꾸준히 능가하는 것으로 나타났습니다. 사전 학습된 모델은 https://github.com/deepglint/MVT에서 공개되었습니다.

English

Learning visual representations is foundational for a broad spectrum of downstream tasks. Although recent vision-language contrastive models, such as CLIP and SigLIP, have achieved impressive zero-shot performance via large-scale vision-language alignment, their reliance on global representations constrains their effectiveness for dense prediction tasks, such as grounding, OCR, and segmentation. To address this gap, we introduce Region-Aware Cluster Discrimination (RICE), a novel method that enhances region-level visual and OCR capabilities. We first construct a billion-scale candidate region dataset and propose a Region Transformer layer to extract rich regional semantics. We further design a unified region cluster discrimination loss that jointly supports object and OCR learning within a single classification framework, enabling efficient and scalable distributed training on large-scale data. Extensive experiments show that RICE consistently outperforms previous methods on tasks, including segmentation, dense detection, and visual perception for Multimodal Large Language Models (MLLMs). The pre-trained models have been released at https://github.com/deepglint/MVT.

시각적 표현 학습을 위한 영역 기반 클러스터 분별

Region-based Cluster Discrimination for Visual Representation Learning

초록

Support