p-MoD: プログレッシブ比率減衰を介した深度の混合モデル付き言語モデルの構築

要旨

多様なタスクにおける多モーダル大規模言語モデル（MLLMs）の優れた性能にもかかわらず、膨大なトレーニングおよび推論コストがその進歩を妨げています。計算の大部分は、トランスフォーマーデコーダーによって処理される圧倒的な量のビジョントークンから生じています。本論文では、各トランスフォーマーデコーダーレイヤーが重要なビジョントークンを選択し、冗長なものをスキップするMixture-of-Depths（MoD）メカニズムを活用して効率的なMLLMsを構築することを提案します。ただし、MoDをMLLMsに統合することは容易ではありません。トレーニングおよび推論の安定性、および限られたトレーニングデータの課題に対処するために、TanhNorm（tanhゲート付き重み正規化）およびSTRing（対称トークン再重み付け）の2つの新しい設計を持つMoDモジュールを適応させます。さらに、ビジョントークンはより深いレイヤーでより高い冗長性を示すことを観察し、段階的な比率減衰（PRD）戦略を設計します。これにより、トークン保持率を段階的にレイヤーごとに減少させ、シフトされたコサインスケジュールを使用します。この重要な設計により、MoDの潜在能力が十分に発揮され、モデルの効率と性能が大幅に向上します。アプローチの有効性を検証するために、14のベンチマークで2つのベースラインモデルとの広範な実験を実施します。推論時には55.6％のTFLOPsと53.8％のKVキャッシュストレージ、トレーニング時には77.7％のGPU時間のみを使用する当社のモデルであるp-MoDは、ベースラインモデルの性能に追いつくか、それを上回ります。

English

Despite the remarkable performance of multimodal large language models (MLLMs) across diverse tasks, the substantial training and inference costs impede their advancement. The majority of computation stems from the overwhelming volume of vision tokens processed by the transformer decoder. In this paper, we propose to build efficient MLLMs by leveraging the Mixture-of-Depths (MoD) mechanism, where each transformer decoder layer selects essential vision tokens to process while skipping redundant ones. However, integrating MoD into MLLMs is non-trivial. To address the challenges of training and inference stability as well as limited training data, we adapt the MoD module with two novel designs: tanh-gated weight normalization (TanhNorm) and symmetric token reweighting (STRing). Moreover, we observe that vision tokens exhibit higher redundancy in deeper layer and thus design a progressive ratio decay (PRD) strategy, which gradually reduces the token retention ratio layer by layer, employing a shifted cosine schedule. This crucial design fully unleashes the potential of MoD, significantly boosting the efficiency and performance of our models. To validate the effectiveness of our approach, we conduct extensive experiments with two baseline models across 14 benchmarks. Our model, p-MoD, matches or even surpasses the performance of the baseline models, with only 55.6% TFLOPs and 53.8% KV cache storage during inference, and 77.7% GPU hours during training.

p-MoD: プログレッシブ比率減衰を介した深度の混合モデル付き言語モデルの構築

p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay

要旨

Support