ITACLIP：画像、テキスト、およびアーキテクチャの強化によるトレーニングフリーの意味的セグメンテーションの向上

要旨

最近の基盤となるビジョン言語モデル（VLMs）の進歩は、コンピュータビジョンタスクにおける評価パラダイムを変革しました。特にCLIPなどの基盤モデルは、Open-Vocabulary Semantic Segmentation（OVSS）を含むオープンボキャブラリーのコンピュータビジョンタスクの研究を加速させました。初期の結果は有望ですが、VLMsの密な予測能力はさらなる改善が必要です。本研究では、新しいモジュールと修正を導入することで、CLIPの意味セグメンテーションのパフォーマンスを向上させます： 1）ViTの最終層でのアーキテクチャの変更と、中間層からのアテンションマップを最終層と組み合わせること、2）画像エンジニアリング：入力画像表現を豊かにするためのデータ拡張の適用、および3）各クラス名の定義と同義語を生成するためにLarge Language Models（LLMs）を使用し、CLIPのオープンボキャブラリーの能力を活用します。当社のトレーニングフリーメソッド、ITACLIPは、COCO-Stuff、COCO-Object、Pascal Context、Pascal VOCなどのセグメンテーションベンチマークで現行の最先端手法を上回ります。当社のコードはhttps://github.com/m-arda-aydn/ITACLIPで入手可能です。

English

Recent advances in foundational Vision Language Models (VLMs) have reshaped the evaluation paradigm in computer vision tasks. These foundational models, especially CLIP, have accelerated research in open-vocabulary computer vision tasks, including Open-Vocabulary Semantic Segmentation (OVSS). Although the initial results are promising, the dense prediction capabilities of VLMs still require further improvement. In this study, we enhance the semantic segmentation performance of CLIP by introducing new modules and modifications: 1) architectural changes in the last layer of ViT and the incorporation of attention maps from the middle layers with the last layer, 2) Image Engineering: applying data augmentations to enrich input image representations, and 3) using Large Language Models (LLMs) to generate definitions and synonyms for each class name to leverage CLIP's open-vocabulary capabilities. Our training-free method, ITACLIP, outperforms current state-of-the-art approaches on segmentation benchmarks such as COCO-Stuff, COCO-Object, Pascal Context, and Pascal VOC. Our code is available at https://github.com/m-arda-aydn/ITACLIP.

ITACLIP：画像、テキスト、およびアーキテクチャの強化によるトレーニングフリーの意味的セグメンテーションの向上

ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements

要旨

Support