γ-MoD: 多モーダル大規模言語モデルのための深さ混合適応の探索

要旨

多モーダル大規模言語モデル（MLLMs）の重要な進展にもかかわらず、その高い計算コストは実世界での展開の障壁となっています。自然言語処理における深さの混合（MoDs）からインスピレーションを受け、この制限を「アクティブ化されたトークン」の観点から解決することを目指しています。私たちの主要な洞察は、ほとんどのトークンがレイヤー計算にとって冗長である場合、MoDレイヤーを介して直接スキップできるということです。ただし、MLLMsの密なレイヤーをMoDレイヤーに直接変換すると、大幅な性能の低下が生じます。この問題を解決するために、既存のMLLMs向けの革新的なMoD適応戦略であるgamma-MoDを提案します。gamma-MoDでは、MoDをMLLMに展開するための指標として、アテンションマップのランク（ARank）が提案されています。ARankを通じて、どのレイヤーが冗長であり、MoDレイヤーに置き換えるべきかを効果的に特定できます。ARankに基づいて、MLLMの計算の疎密度を最大化しつつ性能を維持するための2つの新しい設計を提案しています。それは、共有ビジョン言語ルーターとマスクされたルーティング学習です。これらの設計により、MLLMの90%以上の密なレイヤーを効果的にMoDに変換できます。私たちの手法を検証するために、3つの人気のあるMLLMに適用し、9つのベンチマークデータセットで幅広い実験を行いました。実験結果は、gamma-MoDが既存のMLLMに対する著しい効率の利点を検証するだけでなく、さまざまなMLLMに対する一般化能力を確認しています。たとえば、わずかな性能低下（-1.5%）であるLLaVA-HRのトレーニングおよび推論時間をそれぞれ31.0%と53.2%削減できます。

English

Despite the significant progress in multimodal large language models (MLLMs), their high computational cost remains a barrier to real-world deployment. Inspired by the mixture of depths (MoDs) in natural language processing, we aim to address this limitation from the perspective of ``activated tokens''. Our key insight is that if most tokens are redundant for the layer computation, then can be skipped directly via the MoD layer. However, directly converting the dense layers of MLLMs to MoD layers leads to substantial performance degradation. To address this issue, we propose an innovative MoD adaptation strategy for existing MLLMs called gamma-MoD. In gamma-MoD, a novel metric is proposed to guide the deployment of MoDs in the MLLM, namely rank of attention maps (ARank). Through ARank, we can effectively identify which layer is redundant and should be replaced with the MoD layer. Based on ARank, we further propose two novel designs to maximize the computational sparsity of MLLM while maintaining its performance, namely shared vision-language router and masked routing learning. With these designs, more than 90% dense layers of the MLLM can be effectively converted to the MoD ones. To validate our method, we apply it to three popular MLLMs, and conduct extensive experiments on 9 benchmark datasets. Experimental results not only validate the significant efficiency benefit of gamma-MoD to existing MLLMs but also confirm its generalization ability on various MLLMs. For example, with a minor performance drop, i.e., -1.5%, gamma-MoD can reduce the training and inference time of LLaVA-HR by 31.0% and 53.2%, respectively.

γ-MoD: 多モーダル大規模言語モデルのための深さ混合適応の探索

γ-MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models

要旨

Support