CLIPSym: CLIP을 활용한 대칭성 탐지 심층 연구

초록

대칭성은 컴퓨터 비전에서 가장 근본적인 기하학적 단서 중 하나이며, 이를 탐지하는 것은 지속적인 과제로 남아 있습니다. 최근 비전-언어 모델, 특히 CLIP의 발전에 따라, 우리는 사전 훈련된 CLIP 모델이 자연 이미지 설명에서 발견되는 추가적인 대칭성 단서를 활용하여 대칭성 탐지를 지원할 수 있는지 조사합니다. 우리는 CLIP의 이미지 및 언어 인코더와 Transformer와 G-Convolution의 하이브리드 기반의 회전 등변 디코더를 활용하여 회전 및 반사 대칭성을 탐지하는 CLIPSym을 제안합니다. CLIP의 언어 인코더를 최대한 활용하기 위해, 우리는 다양한 빈도 기반 객체 프롬프트를 집계하여 대칭성 탐지를 위한 의미론적 단서를 더 잘 통합하는 새로운 프롬프트 기법인 Semantic-Aware Prompt Grouping(SAPG)을 개발했습니다. 실험적으로, CLIPSym이 세 가지 표준 대칭성 탐지 데이터셋(DENDI, SDRW, LDRS)에서 현재 최첨단 기술을 능가함을 보여줍니다. 마지막으로, CLIP의 사전 훈련, 제안된 등변 디코더, 그리고 SAPG 기법의 이점을 검증하는 상세한 어블레이션 연구를 수행합니다. 코드는 https://github.com/timyoung2333/CLIPSym에서 확인할 수 있습니다.

English

Symmetry is one of the most fundamental geometric cues in computer vision, and detecting it has been an ongoing challenge. With the recent advances in vision-language models,~i.e., CLIP, we investigate whether a pre-trained CLIP model can aid symmetry detection by leveraging the additional symmetry cues found in the natural image descriptions. We propose CLIPSym, which leverages CLIP's image and language encoders and a rotation-equivariant decoder based on a hybrid of Transformer and G-Convolution to detect rotation and reflection symmetries. To fully utilize CLIP's language encoder, we have developed a novel prompting technique called Semantic-Aware Prompt Grouping (SAPG), which aggregates a diverse set of frequent object-based prompts to better integrate the semantic cues for symmetry detection. Empirically, we show that CLIPSym outperforms the current state-of-the-art on three standard symmetry detection datasets (DENDI, SDRW, and LDRS). Finally, we conduct detailed ablations verifying the benefits of CLIP's pre-training, the proposed equivariant decoder, and the SAPG technique. The code is available at https://github.com/timyoung2333/CLIPSym.

CLIPSym: CLIP을 활용한 대칭성 탐지 심층 연구

CLIPSym: Delving into Symmetry Detection with CLIP

초록

Support