ShortV: Modelos Multimodais de Grande Escala Eficientes através do Congelamento de Tokens Visuais em Camadas Ineficazes

Resumo

Os Modelos de Linguagem Multimodais de Grande Escala (MLLMs) enfrentam altos custos computacionais devido ao seu tamanho massivo e ao grande número de tokens visuais. Neste artigo, investigamos a redundância em camadas dos MLLMs introduzindo uma nova métrica, Contribuição de Camada (LC), que quantifica o impacto das transformações de uma camada sobre os tokens visuais e textuais, respectivamente. O cálculo do LC envolve medir a divergência na saída do modelo que resulta da remoção das transformações da camada sobre os tokens especificados. Nosso experimento piloto revela que muitas camadas dos MLLMs exibem contribuição mínima durante o processamento de tokens visuais. Motivados por essa observação, propomos o ShortV, um método sem necessidade de treinamento que utiliza o LC para identificar camadas ineficazes e congela as atualizações de tokens visuais nessas camadas. Experimentos mostram que o ShortV pode congelar tokens visuais em aproximadamente 60\% das camadas do MLLM, reduzindo drasticamente os custos computacionais relacionados à atualização de tokens visuais. Por exemplo, ele alcança uma redução de 50\% em FLOPs no LLaVA-NeXT-13B enquanto mantém um desempenho superior. O código estará publicamente disponível em https://github.com/icip-cas/ShortV.

English

Multimodal Large Language Models (MLLMs) suffer from high computational costs due to their massive size and the large number of visual tokens. In this paper, we investigate layer-wise redundancy in MLLMs by introducing a novel metric, Layer Contribution (LC), which quantifies the impact of a layer's transformations on visual and text tokens, respectively. The calculation of LC involves measuring the divergence in model output that results from removing the layer's transformations on the specified tokens. Our pilot experiment reveals that many layers of MLLMs exhibit minimal contribution during the processing of visual tokens. Motivated by this observation, we propose ShortV, a training-free method that leverages LC to identify ineffective layers, and freezes visual token updates in these layers. Experiments show that ShortV can freeze visual token in approximately 60\% of the MLLM layers, thereby dramatically reducing computational costs related to updating visual tokens. For example, it achieves a 50\% reduction in FLOPs on LLaVA-NeXT-13B while maintaining superior performance. The code will be publicly available at https://github.com/icip-cas/ShortV

ShortV: Modelos Multimodais de Grande Escala Eficientes através do Congelamento de Tokens Visuais em Camadas Ineficazes

ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers

Resumo

Summary

Support

Support