MMInference:通过模态感知排列稀疏注意力加速长上下文视觉语言模型的预填充
MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention
April 22, 2025
作者: Yucheng Li, Huiqiang Jiang, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Amir H. Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, Lili Qiu
cs.AI
摘要
长上下文能力与视觉理解的融合为视觉语言模型(VLMs)开启了前所未有的潜力。然而,预填充阶段的二次注意力复杂度仍是实际部署中的重大障碍。为克服这一限制,我们引入了MMInference(多模态百万令牌推理),一种动态稀疏注意力方法,旨在加速长上下文多模态输入的预填充阶段。首先,我们的分析揭示,视频输入的时间与空间局部性导致了独特的稀疏模式——网格模式。同时,VLMs在不同模态间展现出显著不同的稀疏分布。我们提出了一种基于排列的方法,以利用这一独特的网格模式并处理模态边界问题。通过离线搜索每个头的最优稀疏模式,MMInference根据输入动态构建稀疏分布。我们还提供了优化的GPU内核,以实现高效的稀疏计算。值得注意的是,MMInference无需任何模型修改或微调,即可无缝集成到现有的VLM流程中。在多模态基准测试(包括视频问答、字幕生成、VisionNIAH及混合模态NIAH)上,结合最先进的长上下文VLMs(LongVila、LlavaVideo、VideoChat-Flash、Qwen2.5-VL)的实验表明,MMInference在1M令牌规模下,预填充阶段加速最高可达8.3倍,同时保持准确性。我们的代码公开于https://aka.ms/MMInference。
English
The integration of long-context capabilities with visual understanding
unlocks unprecedented potential for Vision Language Models (VLMs). However, the
quadratic attention complexity during the pre-filling phase remains a
significant obstacle to real-world deployment. To overcome this limitation, we
introduce MMInference (Multimodality Million tokens Inference), a dynamic
sparse attention method that accelerates the prefilling stage for long-context
multi-modal inputs. First, our analysis reveals that the temporal and spatial
locality of video input leads to a unique sparse pattern, the Grid pattern.
Simultaneously, VLMs exhibit markedly different sparse distributions across
different modalities. We introduce a permutation-based method to leverage the
unique Grid pattern and handle modality boundary issues. By offline search the
optimal sparse patterns for each head, MMInference constructs the sparse
distribution dynamically based on the input. We also provide optimized GPU
kernels for efficient sparse computations. Notably, MMInference integrates
seamlessly into existing VLM pipelines without any model modifications or
fine-tuning. Experiments on multi-modal benchmarks-including Video QA,
Captioning, VisionNIAH, and Mixed-Modality NIAH-with state-of-the-art
long-context VLMs (LongVila, LlavaVideo, VideoChat-Flash, Qwen2.5-VL) show that
MMInference accelerates the pre-filling stage by up to 8.3x at 1M tokens while
maintaining accuracy. Our code is available at https://aka.ms/MMInference.Summary
AI-Generated Summary