γ-MoD：探索用于多模态大型语言模型的深度混合适应

摘要

尽管多模态大型语言模型（MLLMs）取得了显著进展，但其高计算成本仍然是实际部署的障碍。受自然语言处理中深度混合（MoDs）的启发，我们旨在从“激活的标记”角度解决这一限制。我们的关键洞察是，如果大多数标记对于层计算是多余的，那么可以通过MoD层直接跳过它们。然而，直接将MLLMs的密集层转换为MoD层会导致显著的性能下降。为了解决这个问题，我们提出了一种创新的MoD适应策略，称为gamma-MoD。在gamma-MoD中，提出了一种新的度量来指导MLLM中MoDs的部署，即注意力图的排名（ARank）。通过ARank，我们可以有效地确定哪一层是多余的，并应该用MoD层替换。基于ARank，我们进一步提出了两种新设计，以最大限度地提高MLLM的计算稀疏性，同时保持其性能，即共享视觉-语言路由器和掩码路由学习。通过这些设计，MLLM的超过90%的密集层可以有效地转换为MoD层。为了验证我们的方法，我们将其应用于三种流行的MLLM，并在9个基准数据集上进行了大量实验。实验结果不仅验证了gamma-MoD相对于现有MLLM的显著效率优势，还确认了其在各种MLLM上的泛化能力。例如，仅有轻微性能下降，即-1.5%，gamma-MoD可以将LLaVA-HR的训练和推断时间分别减少31.0%和53.2%。

English

Despite the significant progress in multimodal large language models (MLLMs), their high computational cost remains a barrier to real-world deployment. Inspired by the mixture of depths (MoDs) in natural language processing, we aim to address this limitation from the perspective of ``activated tokens''. Our key insight is that if most tokens are redundant for the layer computation, then can be skipped directly via the MoD layer. However, directly converting the dense layers of MLLMs to MoD layers leads to substantial performance degradation. To address this issue, we propose an innovative MoD adaptation strategy for existing MLLMs called gamma-MoD. In gamma-MoD, a novel metric is proposed to guide the deployment of MoDs in the MLLM, namely rank of attention maps (ARank). Through ARank, we can effectively identify which layer is redundant and should be replaced with the MoD layer. Based on ARank, we further propose two novel designs to maximize the computational sparsity of MLLM while maintaining its performance, namely shared vision-language router and masked routing learning. With these designs, more than 90% dense layers of the MLLM can be effectively converted to the MoD ones. To validate our method, we apply it to three popular MLLMs, and conduct extensive experiments on 9 benchmark datasets. Experimental results not only validate the significant efficiency benefit of gamma-MoD to existing MLLMs but also confirm its generalization ability on various MLLMs. For example, with a minor performance drop, i.e., -1.5%, gamma-MoD can reduce the training and inference time of LLaVA-HR by 31.0% and 53.2%, respectively.

γ-MoD：探索用于多模态大型语言模型的深度混合适应

γ-MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models

摘要

Support