Alpha-CLIP:一個專注於您想要的地方的 CLIP 模型
Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
December 6, 2023
作者: Zeyi Sun, Ye Fang, Tong Wu, Pan Zhang, Yuhang Zang, Shu Kong, Yuanjun Xiong, Dahua Lin, Jiaqi Wang
cs.AI
摘要
對比式語言-圖像預訓練(CLIP)在跨越多樣任務中從圖像中提取有價值的內容資訊方面扮演著至關重要的角色。它對齊文本和視覺模式以理解整個圖像,包括所有細節,甚至那些與特定任務無關的細節。然而,為了更細緻地理解和控制編輯圖像,專注於特定感興趣區域變得至關重要,這些區域可以由人類或感知模型指示為點、遮罩或框。為了滿足這些需求,我們引入了Alpha-CLIP,這是CLIP的增強版本,具有輔助的 alpha 通道,用於建議關注的區域,並通過構建的數百萬個 RGBA 區域-文本對進行微調。Alpha-CLIP 不僅保留了 CLIP 的視覺識別能力,還能精確控制對圖像內容的強調。它在各種任務中展現出效果,包括但不限於開放世界識別、多模態大型語言模型以及有條件的 2D / 3D 生成。它具有成為圖像相關任務的多功能工具的潛力。
English
Contrastive Language-Image Pre-training (CLIP) plays an essential role in
extracting valuable content information from images across diverse tasks. It
aligns textual and visual modalities to comprehend the entire image, including
all the details, even those irrelevant to specific tasks. However, for a finer
understanding and controlled editing of images, it becomes crucial to focus on
specific regions of interest, which can be indicated as points, masks, or boxes
by humans or perception models. To fulfill the requirements, we introduce
Alpha-CLIP, an enhanced version of CLIP with an auxiliary alpha channel to
suggest attentive regions and fine-tuned with constructed millions of RGBA
region-text pairs. Alpha-CLIP not only preserves the visual recognition ability
of CLIP but also enables precise control over the emphasis of image contents.
It demonstrates effectiveness in various tasks, including but not limited to
open-world recognition, multimodal large language models, and conditional 2D /
3D generation. It has a strong potential to serve as a versatile tool for
image-related tasks.