CoreMatching：一種共適應稀疏推理框架，結合令牌與神經元剪枝以全面加速視覺-語言模型

摘要

视觉语言模型（VLMs）在多种任务中表现出色，但面临高推理成本，包括时间和内存消耗。令牌稀疏性缓解了令牌使用中的低效问题，而神经元稀疏性则减少了高维计算，两者均为提升效率提供了有前景的解决方案。近期，这两种稀疏性范式大多并行发展，形成了它们独立运作的普遍假设。然而，一个基础却未充分探讨的问题依然存在：它们是否真的孤立运作，还是存在尚未揭示的更深层次相互作用？本文首次对此问题进行了全面研究。通过引入并分析核心神经元与核心令牌之间的匹配机制，我们发现推理中的关键神经元与令牌相互影响并强化彼此。基于这一洞察，我们提出了CoreMatching，一个共适应稀疏推理框架，该框架利用令牌与神经元稀疏性之间的协同效应来提升推理效率。通过理论分析与效率评估，我们证明了所提方法在十项图像理解任务及三种硬件设备上超越了现有最先进的基线。特别是在NVIDIA Titan Xp上，实现了5倍的浮点运算减少和10倍的整体加速。代码已发布于https://github.com/wangqinsi1/2025-ICML-CoreMatching/tree/main。

English

Vision-Language Models (VLMs) excel across diverse tasks but suffer from high inference costs in time and memory. Token sparsity mitigates inefficiencies in token usage, while neuron sparsity reduces high-dimensional computations, both offering promising solutions to enhance efficiency. Recently, these two sparsity paradigms have evolved largely in parallel, fostering the prevailing assumption that they function independently. However, a fundamental yet underexplored question remains: Do they truly operate in isolation, or is there a deeper underlying interplay that has yet to be uncovered? In this paper, we conduct the first comprehensive investigation into this question. By introducing and analyzing the matching mechanism between Core Neurons and Core Tokens, we found that key neurons and tokens for inference mutually influence and reinforce each other. Building on this insight, we propose CoreMatching, a co-adaptive sparse inference framework, which leverages the synergy between token and neuron sparsity to enhance inference efficiency. Through theoretical analysis and efficiency evaluations, we demonstrate that the proposed method surpasses state-of-the-art baselines on ten image understanding tasks and three hardware devices. Notably, on the NVIDIA Titan Xp, it achieved 5x FLOPs reduction and a 10x overall speedup. Code is released at https://github.com/wangqinsi1/2025-ICML-CoreMatching/tree/main.

CoreMatching：一種共適應稀疏推理框架，結合令牌與神經元剪枝以全面加速視覺-語言模型

CoreMatching: A Co-adaptive Sparse Inference Framework with Token and Neuron Pruning for Comprehensive Acceleration of Vision-Language Models

摘要

Support