CLIP을 RNN으로: 학습 노력 없이 무수한 시각 개념 분할하기

초록

기존의 개방형 어휘 이미지 분할 방법은 마스크 주석 및/또는 이미지-텍스트 데이터셋에 대한 미세 조정 단계를 필요로 합니다. 마스크 레이블링은 노동 집약적이어서 분할 데이터셋의 카테고리 수가 제한됩니다. 그 결과, 사전 훈련된 시각-언어 모델(VLM)의 개방형 어휘 능력은 미세 조정 후 심각하게 감소합니다. 그러나 미세 조정 없이는 약한 이미지-텍스트 감독 하에 훈련된 VLM이 이미지에 존재하지 않는 개념을 참조하는 텍스트 쿼리가 있을 때 최적이 아닌 마스크 예측을 하는 경향이 있습니다. 이러한 문제를 완화하기 위해, 우리는 훈련 노력 없이도 관련 없는 텍스트를 점진적으로 걸러내고 마스크 품질을 향상시키는 새로운 순환 프레임워크를 소개합니다. 이 순환 단위는 가중치가 고정된 VLM을 기반으로 하는 두 단계의 분할기입니다. 따라서 우리의 모델은 VLM의 광범위한 어휘 공간을 유지하면서 분할 능력을 강화합니다. 실험 결과는 우리의 방법이 훈련이 필요 없는 대조군뿐만 아니라 수백만 개의 추가 데이터 샘플로 미세 조정된 모델들도 능가하며, 제로샷 의미론적 및 참조 이미지 분할 작업 모두에서 새로운 최첨단 기록을 세웠음을 보여줍니다. 구체적으로, 우리는 Pascal VOC, COCO Object, Pascal Context에서 현재 기록을 각각 28.8, 16.0, 6.9 mIoU만큼 향상시켰습니다.

English

Existing open-vocabulary image segmentation methods require a fine-tuning step on mask annotations and/or image-text datasets. Mask labels are labor-intensive, which limits the number of categories in segmentation datasets. As a result, the open-vocabulary capacity of pre-trained VLMs is severely reduced after fine-tuning. However, without fine-tuning, VLMs trained under weak image-text supervision tend to make suboptimal mask predictions when there are text queries referring to non-existing concepts in the image. To alleviate these issues, we introduce a novel recurrent framework that progressively filters out irrelevant texts and enhances mask quality without training efforts. The recurrent unit is a two-stage segmenter built upon a VLM with frozen weights. Thus, our model retains the VLM's broad vocabulary space and strengthens its segmentation capability. Experimental results show that our method outperforms not only the training-free counterparts, but also those fine-tuned with millions of additional data samples, and sets new state-of-the-art records for both zero-shot semantic and referring image segmentation tasks. Specifically, we improve the current record by 28.8, 16.0, and 6.9 mIoU on Pascal VOC, COCO Object, and Pascal Context.

CLIP을 RNN으로: 학습 노력 없이 무수한 시각 개념 분할하기

CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor

초록

Support