개방형 어휘 부위 분할을 통한 고밀도화 접근

초록

객체 탐지는 제한된 범주의 수에서 오픈 보컬러리로 확장되었습니다. 앞으로 완전한 지능형 비전 시스템은 더 세분화된 객체 설명과 객체 부위를 이해해야 합니다. 본 논문에서는 오픈 보컬러리 객체와 그 부위 분할을 모두 예측할 수 있는 탐지기를 제안합니다. 이 능력은 두 가지 설계에서 비롯됩니다. 첫째, 부위 수준, 객체 수준, 이미지 수준 데이터를 결합하여 탐지기를 학습시켜 언어와 이미지 간의 다중 세분화 정렬을 구축합니다. 둘째, 새로운 객체를 기본 객체와의 밀집된 의미적 대응을 통해 부위로 파싱합니다. 이 두 설계는 탐지기가 다양한 데이터 소스와 기초 모델로부터 크게 이점을 얻을 수 있게 합니다. 오픈 보컬러리 부위 분할 실험에서, 우리의 방법은 PartImageNet에서의 데이터셋 간 일반화에서 기준선보다 3.3~7.3 mAP 우수한 성능을 보였으며, Pascal Part에서의 범주 간 일반화에서 기준선보다 7.3 novel AP_{50}를 개선했습니다. 마지막으로, 우리는 다양한 부위 분할 데이터셋에 일반화되면서 데이터셋 특화 학습보다 더 나은 성능을 달성하는 탐지기를 학습시켰습니다.

English

Object detection has been expanded from a limited number of categories to open vocabulary. Moving forward, a complete intelligent vision system requires understanding more fine-grained object descriptions, object parts. In this paper, we propose a detector with the ability to predict both open-vocabulary objects and their part segmentation. This ability comes from two designs. First, we train the detector on the joint of part-level, object-level and image-level data to build the multi-granularity alignment between language and image. Second, we parse the novel object into its parts by its dense semantic correspondence with the base object. These two designs enable the detector to largely benefit from various data sources and foundation models. In open-vocabulary part segmentation experiments, our method outperforms the baseline by 3.3sim7.3 mAP in cross-dataset generalization on PartImageNet, and improves the baseline by 7.3 novel AP_{50} in cross-category generalization on Pascal Part. Finally, we train a detector that generalizes to a wide range of part segmentation datasets while achieving better performance than dataset-specific training.

개방형 어휘 부위 분할을 통한 고밀도화 접근

Going Denser with Open-Vocabulary Part Segmentation

초록

Support