CLIP-MoE: 旨在利用多样化的多重升级循环构建CLIP的专家混合模型
CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling
September 28, 2024
作者: Jihai Zhang, Xiaoye Qu, Tong Zhu, Yu Cheng
cs.AI
摘要
近年来,对比语言-图像预训练(CLIP)已成为多模态智能的基石。然而,最近的研究发现,CLIP 编码过程中存在大量信息丢失,CLIP 倾向于仅捕获输入中的粗粒度特征。这种不足显著限制了单个 CLIP 模型处理视觉细节丰富的图像的能力。在这项工作中,我们提出了一种简单而有效的模型无关策略,即多样化多重升级(DMU),用于 CLIP。DMU 有效地微调一系列捕获不同特征空间的 CLIP 模型,这些模型来自于一个稠密预训练的 CLIP 检查点,参数共享,除了前馈网络(FFN)。然后,这些模型可以转换为具有更大模型容量的 CLIP-MoE,从而显著提高性能,而计算开销最小。据我们所知,多样化多重升级是第一个在 CLIP 基础模型中引入稀疏激活 MoE 的方法。大量实验证明了 CLIP-MoE 在各种零样本检索、零样本图像分类任务和下游多模态大型语言模型(MLLM)基准上的显著性能,作为视觉编码器。此外,多样化多重升级使得任何稠密 CLIP 模型都能转换为 CLIP-MoE,可以在下游框架中以即插即用的方式无缝替换 CLIP,而无需进一步调整。通过多样化多重升级,我们旨在为未来研究提供有关开发更高效和有效的多模态学习系统的宝贵见解。
English
In recent years, Contrastive Language-Image Pre-training (CLIP) has become a
cornerstone in multimodal intelligence. However, recent studies have identified
that the information loss in the CLIP encoding process is substantial, and CLIP
tends to capture only coarse-grained features from the input. This deficiency
significantly limits the ability of a single CLIP model to handle images rich
in visual detail. In this work, we propose a simple yet effective
model-agnostic strategy, Diversified Multiplet Upcycling (DMU), for CLIP. DMU
efficiently fine-tunes a series of CLIP models that capture different feature
spaces, from a dense pre-trained CLIP checkpoint, sharing parameters except for
the Feed-Forward Network (FFN). These models can then be transformed into a
CLIP-MoE with a larger model capacity, leading to significantly enhanced
performance with minimal computational overhead. To the best of our knowledge,
Diversified Multiplet Upcycling is the first approach to introduce sparsely
activated MoE into CLIP foundation models. Extensive experiments demonstrate
the significant performance of CLIP-MoE across various zero-shot retrieval,
zero-shot image classification tasks, and downstream Multimodal Large Language
Model (MLLM) benchmarks by serving as a vision encoder. Furthermore,
Diversified Multiplet Upcycling enables the conversion of any dense CLIP model
into CLIP-MoEs, which can seamlessly replace CLIP in a plug-and-play manner
without requiring further adaptation in downstream frameworks. Through
Diversified Multiplet Upcycling, we aim to provide valuable insights for future
research on developing more efficient and effective multimodal learning
systems.Summary
AI-Generated Summary