Alpha-CLIP: 원하는 곳에 초점을 맞추는 CLIP 모델

초록

대조적 언어-이미지 사전학습(Contrastive Language-Image Pre-training, CLIP)은 다양한 작업에서 이미지로부터 유용한 콘텐츠 정보를 추출하는 데 핵심적인 역할을 합니다. 이는 텍스트와 시각적 모달리티를 정렬하여 특정 작업과 무관한 세부 사항을 포함한 전체 이미지를 이해합니다. 그러나 이미지를 더 세밀하게 이해하고 제어된 편집을 수행하기 위해서는 인간이나 인지 모델이 지정한 점, 마스크, 또는 박스와 같은 특정 관심 영역에 초점을 맞추는 것이 중요합니다. 이러한 요구를 충족시키기 위해, 우리는 Alpha-CLIP을 소개합니다. 이는 CLIP의 향상된 버전으로, 주의 영역을 제안하기 위한 보조 알파 채널을 포함하며, 수백만 개의 RGBA 영역-텍스트 쌍으로 미세 조정되었습니다. Alpha-CLIP은 CLIP의 시각적 인식 능력을 유지하면서도 이미지 콘텐츠의 강조를 정밀하게 제어할 수 있습니다. 이는 개방형 세계 인식, 멀티모달 대형 언어 모델, 조건부 2D/3D 생성 등 다양한 작업에서 효과를 입증하였으며, 이미지 관련 작업을 위한 다목적 도구로서의 강력한 잠재력을 가지고 있습니다.

English

Contrastive Language-Image Pre-training (CLIP) plays an essential role in extracting valuable content information from images across diverse tasks. It aligns textual and visual modalities to comprehend the entire image, including all the details, even those irrelevant to specific tasks. However, for a finer understanding and controlled editing of images, it becomes crucial to focus on specific regions of interest, which can be indicated as points, masks, or boxes by humans or perception models. To fulfill the requirements, we introduce Alpha-CLIP, an enhanced version of CLIP with an auxiliary alpha channel to suggest attentive regions and fine-tuned with constructed millions of RGBA region-text pairs. Alpha-CLIP not only preserves the visual recognition ability of CLIP but also enables precise control over the emphasis of image contents. It demonstrates effectiveness in various tasks, including but not limited to open-world recognition, multimodal large language models, and conditional 2D / 3D generation. It has a strong potential to serve as a versatile tool for image-related tasks.

Alpha-CLIP: 원하는 곳에 초점을 맞추는 CLIP 모델

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

초록

Support