CosmoCLIP：将大型视觉-语言模型泛化应用于天文成像

摘要

现有的视觉文本对比学习模型通过匹配配对的图像和标题嵌入，同时将不相关的配对分开，增强了表示的可传递性，并支持零样本预测。然而，天文图像标签数据集与互联网上可用的一般图像和标签数据集相比要小得多。我们引入了CosmoCLIP，这是一个精确在预训练的CLIP模型上微调的天文图像文本对比学习框架，使用SpaceNet和基于BLIP的标题。通过FLARE获得的SpaceNet包含约13k张优化分布的图像，而BLIP充当丰富的知识提取器。从SpaceNet和BLIP描述中提取的丰富语义，在对比学习时使CosmoCLIP能够在各种领域内和领域外的任务中实现优越的泛化。我们的结果表明，CosmoCLIP是一个简单而强大的框架，在零样本分类和图像文本检索任务中明显优于CLIP。

English

Existing vision-text contrastive learning models enhance representation transferability and support zero-shot prediction by matching paired image and caption embeddings while pushing unrelated pairs apart. However, astronomical image-label datasets are significantly smaller compared to general image and label datasets available from the internet. We introduce CosmoCLIP, an astronomical image-text contrastive learning framework precisely fine-tuned on the pre-trained CLIP model using SpaceNet and BLIP-based captions. SpaceNet, attained via FLARE, constitutes ~13k optimally distributed images, while BLIP acts as a rich knowledge extractor. The rich semantics derived from this SpaceNet and BLIP descriptions, when learned contrastively, enable CosmoCLIP to achieve superior generalization across various in-domain and out-of-domain tasks. Our results demonstrate that CosmoCLIP is a straightforward yet powerful framework, significantly outperforming CLIP in zero-shot classification and image-text retrieval tasks.

CosmoCLIP：将大型视觉-语言模型泛化应用于天文成像

CosmoCLIP: Generalizing Large Vision-Language Models for Astronomical Imaging

摘要

Support