Alpha-CLIP：一個專注於您想要的地方的 CLIP 模型

摘要

對比式語言-圖像預訓練（CLIP）在跨越多樣任務中從圖像中提取有價值的內容資訊方面扮演著至關重要的角色。它對齊文本和視覺模式以理解整個圖像，包括所有細節，甚至那些與特定任務無關的細節。然而，為了更細緻地理解和控制編輯圖像，專注於特定感興趣區域變得至關重要，這些區域可以由人類或感知模型指示為點、遮罩或框。為了滿足這些需求，我們引入了Alpha-CLIP，這是CLIP的增強版本，具有輔助的 alpha 通道，用於建議關注的區域，並通過構建的數百萬個 RGBA 區域-文本對進行微調。Alpha-CLIP 不僅保留了 CLIP 的視覺識別能力，還能精確控制對圖像內容的強調。它在各種任務中展現出效果，包括但不限於開放世界識別、多模態大型語言模型以及有條件的 2D / 3D 生成。它具有成為圖像相關任務的多功能工具的潛力。

English

Contrastive Language-Image Pre-training (CLIP) plays an essential role in extracting valuable content information from images across diverse tasks. It aligns textual and visual modalities to comprehend the entire image, including all the details, even those irrelevant to specific tasks. However, for a finer understanding and controlled editing of images, it becomes crucial to focus on specific regions of interest, which can be indicated as points, masks, or boxes by humans or perception models. To fulfill the requirements, we introduce Alpha-CLIP, an enhanced version of CLIP with an auxiliary alpha channel to suggest attentive regions and fine-tuned with constructed millions of RGBA region-text pairs. Alpha-CLIP not only preserves the visual recognition ability of CLIP but also enables precise control over the emphasis of image contents. It demonstrates effectiveness in various tasks, including but not limited to open-world recognition, multimodal large language models, and conditional 2D / 3D generation. It has a strong potential to serve as a versatile tool for image-related tasks.

Alpha-CLIP：一個專注於您想要的地方的 CLIP 模型

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

摘要

Support