稀疏MM：视觉概念响应在多模态大语言模型中引发头部稀疏性

摘要

多模态大语言模型（MLLMs）通常通过扩展预训练的大语言模型（LLMs）以融入视觉能力而构建。在本研究中，我们深入探究了MLLMs如何处理视觉输入，特别聚焦于其注意力机制的分析。我们揭示了一个令人惊讶的稀疏性现象：在LLMs中，仅有少数（约不足5%）的注意力头对视觉理解起到积极作用，这些被称为视觉头。为了高效识别这些视觉头，我们设计了一种无需训练的框架，通过定向响应分析量化各头的视觉相关性。基于这一发现，我们提出了SparseMM，一种KV-Cache优化策略，它依据各头的视觉评分，为LLMs中的头分配非对称计算预算，利用视觉头的稀疏性加速MLLMs的推理过程。与以往忽视视觉特殊性的KV-Cache加速方法相比，SparseMM在解码过程中优先保障并保留视觉语义。在主流多模态基准上的广泛评估表明，SparseMM实现了更优的精度与效率平衡。值得注意的是，SparseMM在生成过程中实现了1.38倍的实时加速和52%的内存减少，同时在效率测试中保持了性能一致。我们的项目已开源，地址为https://github.com/CR400AF-A/SparseMM。

English

Multimodal Large Language Models (MLLMs) are commonly derived by extending pre-trained Large Language Models (LLMs) with visual capabilities. In this work, we investigate how MLLMs process visual inputs by analyzing their attention mechanisms. We reveal a surprising sparsity phenomenon: only a small subset (approximately less than 5%) of attention heads in LLMs actively contribute to visual understanding, termed visual heads. To identify these heads efficiently, we design a training-free framework that quantifies head-level visual relevance through targeted response analysis. Building on this discovery, we introduce SparseMM, a KV-Cache optimization strategy that allocates asymmetric computation budgets to heads in LLMs based on their visual scores, leveraging the sparity of visual heads for accelerating the inference of MLLMs. Compared with prior KV-Cache acceleration methods that ignore the particularity of visual, SparseMM prioritizes stress and retaining visual semantics during decoding. Extensive evaluations across mainstream multimodal benchmarks demonstrate that SparseMM achieves superior accuracy-efficiency trade-offs. Notably, SparseMM delivers 1.38x real-time acceleration and 52% memory reduction during generation while maintaining performance parity on efficiency test. Our project is open sourced at https://github.com/CR400AF-A/SparseMM.

稀疏MM：视觉概念响应在多模态大语言模型中引发头部稀疏性

SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs

摘要

Support