CoreMatching：一种协同自适应稀疏推理框架，通过令牌与神经元剪枝实现视觉-语言模型的全面加速

摘要

视觉语言模型（VLMs）在多种任务中表现出色，但其推理过程在时间和内存上成本高昂。通过令牌稀疏性可以缓解令牌使用中的低效问题，而神经元稀疏性则能减少高维计算，两者均为提升效率提供了有前景的解决方案。近期，这两种稀疏性范式大多并行发展，形成了它们独立运作的普遍假设。然而，一个基础却尚未深入探讨的问题依然存在：它们是否真的孤立运作，还是存在一种尚未揭示的深层次相互作用？本文首次对这一疑问进行了全面探究。通过引入并分析核心神经元与核心令牌之间的匹配机制，我们发现推理中的关键神经元与令牌相互影响并强化彼此。基于这一洞见，我们提出了CoreMatching，一个协同适应的稀疏推理框架，它利用令牌与神经元稀疏性之间的协同效应来提升推理效率。通过理论分析与效率评估，我们证明了所提方法在十项图像理解任务及三种硬件设备上均超越了现有最佳基线。特别是在NVIDIA Titan Xp上，实现了5倍的浮点运算减少和10倍的整体加速。代码已发布于https://github.com/wangqinsi1/2025-ICML-CoreMatching/tree/main。

English

Vision-Language Models (VLMs) excel across diverse tasks but suffer from high inference costs in time and memory. Token sparsity mitigates inefficiencies in token usage, while neuron sparsity reduces high-dimensional computations, both offering promising solutions to enhance efficiency. Recently, these two sparsity paradigms have evolved largely in parallel, fostering the prevailing assumption that they function independently. However, a fundamental yet underexplored question remains: Do they truly operate in isolation, or is there a deeper underlying interplay that has yet to be uncovered? In this paper, we conduct the first comprehensive investigation into this question. By introducing and analyzing the matching mechanism between Core Neurons and Core Tokens, we found that key neurons and tokens for inference mutually influence and reinforce each other. Building on this insight, we propose CoreMatching, a co-adaptive sparse inference framework, which leverages the synergy between token and neuron sparsity to enhance inference efficiency. Through theoretical analysis and efficiency evaluations, we demonstrate that the proposed method surpasses state-of-the-art baselines on ten image understanding tasks and three hardware devices. Notably, on the NVIDIA Titan Xp, it achieved 5x FLOPs reduction and a 10x overall speedup. Code is released at https://github.com/wangqinsi1/2025-ICML-CoreMatching/tree/main.

CoreMatching：一种协同自适应稀疏推理框架，通过令牌与神经元剪枝实现视觉-语言模型的全面加速

CoreMatching: A Co-adaptive Sparse Inference Framework with Token and Neuron Pruning for Comprehensive Acceleration of Vision-Language Models

摘要

Support