SILC: 자기 지식을 활용한 비전-언어 사전 학습 성능 향상

초록

웹 규모의 이미지 캡션 데이터셋을 이용한 이미지-텍스트 사전 학습은 CLIP 및 그 변형 모델들의 성공 덕분에 개방형 어휘 분류 및 검색 모델의 기본적인 접근법으로 자리 잡았습니다. 여러 연구에서도 CLIP의 특징을 밀집 예측(dense prediction) 작업에 활용하며 개방형 집합(open-set) 능력의 출현을 보여주었습니다. 그러나 대조 학습(contrastive learning) 목표는 이미지-텍스트 정렬에만 초점을 맞추고 있어 밀집 예측 작업을 위한 이미지 특징 학습을 촉진하지는 않습니다. 본 연구에서는 SILC를 제안하기 위해 대조 사전 학습에 추가 목표로 자기 지식 증류(self-distillation)를 통한 지역적-전역적 대응 학습(local-to-global correspondence learning)을 간단히 추가합니다. 지수 이동 평균(EMA) 교사 모델로부터 지역적 이미지 특징을 증류하는 것이 분류, 검색, 특히 세그멘테이션을 포함한 여러 컴퓨터 비전 작업에서 모델 성능을 크게 향상시킨다는 것을 보여줍니다. 또한 SILC가 동일한 학습 기간 동안 기준 모델(baseline)보다 더 나은 확장성을 보인다는 것을 입증합니다. 우리의 모델 SILC는 제로샷 분류(zero-shot classification), 퓨샷 분류(few-shot classification), 이미지 및 텍스트 검색, 제로샷 세그멘테이션(zero-shot segmentation), 그리고 개방형 어휘 세그멘테이션(open vocabulary segmentation)에서 새로운 최첨단(state-of-the-art) 성능을 달성합니다.

English

Image-Text pretraining on web-scale image caption dataset has become the default recipe for open vocabulary classification and retrieval models thanks to the success of CLIP and its variants. Several works have also used CLIP features for dense prediction tasks and have shown the emergence of open-set abilities. However, the contrastive objective only focuses on image-text alignment and does not incentivise image feature learning for dense prediction tasks. In this work, we propose the simple addition of local-to-global correspondence learning by self-distillation as an additional objective for contrastive pre-training to propose SILC. We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on several computer vision tasks including classification, retrieval, and especially segmentation. We further show that SILC scales better with the same training duration compared to the baselines. Our model SILC sets a new state of the art for zero-shot classification, few shot classification, image and text retrieval, zero-shot segmentation, and open vocabulary segmentation.

SILC: 자기 지식을 활용한 비전-언어 사전 학습 성능 향상

SILC: Improving Vision Language Pretraining with Self-Distillation

초록

Support