ShortV：通過凍結無效層中的視覺標記來實現高效的多模態大型語言模型

摘要

多模态大型語言模型（MLLMs）因其龐大的規模和大量的視覺標記而面臨高昂的計算成本。本文中，我們通過引入一種新穎的度量標準——層貢獻度（Layer Contribution, LC），來研究MLLMs中的層級冗餘問題。LC量化了某一層的轉換對視覺和文本標記的影響，其計算涉及移除該層對指定標記的轉換後模型輸出的差異。我們的初步實驗表明，MLLMs的許多層在處理視覺標記時貢獻度極低。基於這一觀察，我們提出了ShortV，這是一種無需訓練的方法，利用LC來識別無效層，並在這些層中凍結視覺標記的更新。實驗結果顯示，ShortV能夠在約60%的MLLM層中凍結視覺標記，從而顯著降低與更新視覺標記相關的計算成本。例如，在LLaVA-NeXT-13B上，它實現了50%的浮點運算次數（FLOPs）減少，同時保持了優異的性能。代碼將公開於https://github.com/icip-cas/ShortV。

English

Multimodal Large Language Models (MLLMs) suffer from high computational costs due to their massive size and the large number of visual tokens. In this paper, we investigate layer-wise redundancy in MLLMs by introducing a novel metric, Layer Contribution (LC), which quantifies the impact of a layer's transformations on visual and text tokens, respectively. The calculation of LC involves measuring the divergence in model output that results from removing the layer's transformations on the specified tokens. Our pilot experiment reveals that many layers of MLLMs exhibit minimal contribution during the processing of visual tokens. Motivated by this observation, we propose ShortV, a training-free method that leverages LC to identify ineffective layers, and freezes visual token updates in these layers. Experiments show that ShortV can freeze visual token in approximately 60\% of the MLLM layers, thereby dramatically reducing computational costs related to updating visual tokens. For example, it achieves a 50\% reduction in FLOPs on LLaVA-NeXT-13B while maintaining superior performance. The code will be publicly available at https://github.com/icip-cas/ShortV

ShortV：通過凍結無效層中的視覺標記來實現高效的多模態大型語言模型

ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers

摘要

Support