稀疏MM：多模態大語言模型中視覺概念響應引發的頭部稀疏性

摘要

多模态大型語言模型（MLLMs）通常通過擴展預訓練的大型語言模型（LLMs）並賦予其視覺能力來構建。在本研究中，我們通過分析MLLMs的注意力機制，探討了它們如何處理視覺輸入。我們揭示了一個令人驚訝的稀疏性現象：在LLMs中，僅有少數（約少於5%）的注意力頭積極參與視覺理解，這些被稱為視覺頭。為了高效識別這些頭，我們設計了一種無需訓練的框架，通過針對性的響應分析來量化頭級別的視覺相關性。基於這一發現，我們引入了SparseMM，這是一種KV-Cache優化策略，它根據頭的視覺分數為LLMs中的頭分配非對稱計算預算，利用視覺頭的稀疏性來加速MLLMs的推理。與之前忽略視覺特殊性的KV-Cache加速方法相比，SparseMM在解碼過程中優先考慮並保留視覺語義。在主流多模態基準上的廣泛評估表明，SparseMM實現了優越的準確性與效率的平衡。值得注意的是，SparseMM在生成過程中實現了1.38倍的實時加速和52%的內存減少，同時在效率測試中保持了性能的同等水平。我們的項目已開源於https://github.com/CR400AF-A/SparseMM。

English

Multimodal Large Language Models (MLLMs) are commonly derived by extending pre-trained Large Language Models (LLMs) with visual capabilities. In this work, we investigate how MLLMs process visual inputs by analyzing their attention mechanisms. We reveal a surprising sparsity phenomenon: only a small subset (approximately less than 5%) of attention heads in LLMs actively contribute to visual understanding, termed visual heads. To identify these heads efficiently, we design a training-free framework that quantifies head-level visual relevance through targeted response analysis. Building on this discovery, we introduce SparseMM, a KV-Cache optimization strategy that allocates asymmetric computation budgets to heads in LLMs based on their visual scores, leveraging the sparity of visual heads for accelerating the inference of MLLMs. Compared with prior KV-Cache acceleration methods that ignore the particularity of visual, SparseMM prioritizes stress and retaining visual semantics during decoding. Extensive evaluations across mainstream multimodal benchmarks demonstrate that SparseMM achieves superior accuracy-efficiency trade-offs. Notably, SparseMM delivers 1.38x real-time acceleration and 52% memory reduction during generation while maintaining performance parity on efficiency test. Our project is open sourced at https://github.com/CR400AF-A/SparseMM.

稀疏MM：多模態大語言模型中視覺概念響應引發的頭部稀疏性

SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs

摘要

Support