ShortV: 非効率な層における視覚トークンの凍結による効率的なマルチモーダル大規模言語モデル

要旨

マルチモーダル大規模言語モデル（MLLMs）は、その巨大なサイズと大量の視覚トークンにより、高い計算コストに悩まされています。本論文では、新しい指標である「層貢献度（Layer Contribution, LC）」を導入し、MLLMsにおける層ごとの冗長性を調査します。LCは、特定のトークンに対する層の変換を除去した際のモデル出力の差異を測定することで、視覚トークンとテキストトークンそれぞれに対する層の影響を定量化します。予備実験により、MLLMsの多くの層が視覚トークンの処理において最小限の貢献しか示さないことが明らかになりました。この観察に基づき、我々はLCを活用して非効率な層を特定し、これらの層における視覚トークンの更新を凍結するトレーニング不要の手法「ShortV」を提案します。実験結果から、ShortVはMLLMsの約60％の層で視覚トークンの更新を凍結し、視覚トークン更新に関連する計算コストを劇的に削減できることが示されました。例えば、LLaVA-NeXT-13BにおいてFLOPsを50％削減しつつ、優れた性能を維持します。コードはhttps://github.com/icip-cas/ShortVで公開予定です。

English

Multimodal Large Language Models (MLLMs) suffer from high computational costs due to their massive size and the large number of visual tokens. In this paper, we investigate layer-wise redundancy in MLLMs by introducing a novel metric, Layer Contribution (LC), which quantifies the impact of a layer's transformations on visual and text tokens, respectively. The calculation of LC involves measuring the divergence in model output that results from removing the layer's transformations on the specified tokens. Our pilot experiment reveals that many layers of MLLMs exhibit minimal contribution during the processing of visual tokens. Motivated by this observation, we propose ShortV, a training-free method that leverages LC to identify ineffective layers, and freezes visual token updates in these layers. Experiments show that ShortV can freeze visual token in approximately 60\% of the MLLM layers, thereby dramatically reducing computational costs related to updating visual tokens. For example, it achieves a 50\% reduction in FLOPs on LLaVA-NeXT-13B while maintaining superior performance. The code will be publicly available at https://github.com/icip-cas/ShortV

ShortV: 非効率な層における視覚トークンの凍結による効率的なマルチモーダル大規模言語モデル

ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers

要旨

Support