CLIPSym: CLIPを用いた対称性検出の探求

要旨

対称性はコンピュータビジョンにおける最も基本的な幾何学的な手がかりの一つであり、その検出は継続的な課題となっています。最近の視覚-言語モデル、特にCLIPの進展に伴い、自然画像の記述に見られる追加の対称性の手がかりを活用することで、事前学習済みのCLIPモデルが対称性検出に役立つかどうかを調査しました。我々はCLIPSymを提案します。これは、CLIPの画像エンコーダと言語エンコーダ、およびTransformerとG-Convolutionのハイブリッドに基づく回転等変デコーダを活用して、回転対称性と鏡映対称性を検出します。CLIPの言語エンコーダを最大限に活用するために、Semantic-Aware Prompt Grouping (SAPG)と呼ばれる新しいプロンプト技術を開発しました。これは、多様な頻出オブジェクトベースのプロンプトを集約し、対称性検出のための意味的な手がかりをより良く統合します。実験的に、CLIPSymが3つの標準的な対称性検出データセット（DENDI、SDRW、LDRS）において、現在の最先端技術を上回ることを示します。最後に、CLIPの事前学習、提案された等変デコーダ、およびSAPG技術の利点を検証する詳細なアブレーション実験を行いました。コードはhttps://github.com/timyoung2333/CLIPSymで公開されています。

English

Symmetry is one of the most fundamental geometric cues in computer vision, and detecting it has been an ongoing challenge. With the recent advances in vision-language models,~i.e., CLIP, we investigate whether a pre-trained CLIP model can aid symmetry detection by leveraging the additional symmetry cues found in the natural image descriptions. We propose CLIPSym, which leverages CLIP's image and language encoders and a rotation-equivariant decoder based on a hybrid of Transformer and G-Convolution to detect rotation and reflection symmetries. To fully utilize CLIP's language encoder, we have developed a novel prompting technique called Semantic-Aware Prompt Grouping (SAPG), which aggregates a diverse set of frequent object-based prompts to better integrate the semantic cues for symmetry detection. Empirically, we show that CLIPSym outperforms the current state-of-the-art on three standard symmetry detection datasets (DENDI, SDRW, and LDRS). Finally, we conduct detailed ablations verifying the benefits of CLIP's pre-training, the proposed equivariant decoder, and the SAPG technique. The code is available at https://github.com/timyoung2333/CLIPSym.

CLIPSym: CLIPを用いた対称性検出の探求

CLIPSym: Delving into Symmetry Detection with CLIP

要旨

Support