Alpha-CLIP: あなたが望む場所に焦点を当てたCLIPモデル

要旨

コントラスティブ言語-画像事前学習（CLIP）は、多様なタスクにおいて画像から有益なコンテンツ情報を抽出する上で重要な役割を果たします。CLIPはテキストと視覚のモダリティを整合させ、特定のタスクに関係のない詳細も含めて画像全体を理解します。しかし、より細かい理解と制御された画像編集のためには、人間や知覚モデルによって点、マスク、またはボックスとして示される特定の関心領域に焦点を当てることが重要です。この要件を満たすために、Alpha-CLIPを導入します。これは、注目領域を示す補助的なアルファチャンネルを備えたCLIPの強化版であり、構築された数百万のRGBA領域-テキストペアで微調整されています。Alpha-CLIPは、CLIPの視覚認識能力を維持するだけでなく、画像コンテンツの強調を精密に制御することができます。オープンワールド認識、マルチモーダル大規模言語モデル、条件付き2D/3D生成など、さまざまなタスクにおいて有効性を示しており、画像関連タスクの汎用ツールとしての強い可能性を秘めています。

English

Contrastive Language-Image Pre-training (CLIP) plays an essential role in extracting valuable content information from images across diverse tasks. It aligns textual and visual modalities to comprehend the entire image, including all the details, even those irrelevant to specific tasks. However, for a finer understanding and controlled editing of images, it becomes crucial to focus on specific regions of interest, which can be indicated as points, masks, or boxes by humans or perception models. To fulfill the requirements, we introduce Alpha-CLIP, an enhanced version of CLIP with an auxiliary alpha channel to suggest attentive regions and fine-tuned with constructed millions of RGBA region-text pairs. Alpha-CLIP not only preserves the visual recognition ability of CLIP but also enables precise control over the emphasis of image contents. It demonstrates effectiveness in various tasks, including but not limited to open-world recognition, multimodal large language models, and conditional 2D / 3D generation. It has a strong potential to serve as a versatile tool for image-related tasks.

Alpha-CLIP: あなたが望む場所に焦点を当てたCLIPモデル

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

要旨

Support