MMInference: 모달리티 인지 순열 희소 주의를 통한 장문맥 VLM 사전 채우기 가속화

초록

장기 문맥 처리 능력과 시각적 이해의 통합은 비전 언어 모델(Vision Language Models, VLMs)에 있어 전례 없는 잠재력을 발휘합니다. 그러나 사전 채우기 단계에서 발생하는 이차적 주의 복잡도는 실제 환경에서의 배포에 있어 여전히 큰 장애물로 남아 있습니다. 이러한 한계를 극복하기 위해, 우리는 MMInference(Multimodality Million tokens Inference)를 소개합니다. 이는 장기 문맥 다중 모달 입력에 대한 사전 채우기 단계를 가속화하는 동적 희소 주의 메커니즘입니다. 먼저, 우리의 분석은 비디오 입력의 시간적 및 공간적 지역성이 독특한 희소 패턴인 그리드 패턴을 유발한다는 것을 보여줍니다. 동시에, VLMs은 서로 다른 모달리티 간에 현저히 다른 희소 분포를 보입니다. 우리는 이러한 독특한 그리드 패턴을 활용하고 모달리티 경계 문제를 처리하기 위해 순열 기반 방법을 도입했습니다. MMInference는 각 헤드에 대해 최적의 희소 패턴을 오프라인에서 탐색하고, 이를 기반으로 입력에 따라 동적으로 희소 분포를 구성합니다. 또한, 효율적인 희소 계산을 위해 최적화된 GPU 커널을 제공합니다. 특히, MMInference는 기존 VLM 파이프라인에 모델 수정이나 미세 조정 없이 원활하게 통합됩니다. Video QA, Captioning, VisionNIAH, Mixed-Modality NIAH를 포함한 다중 모달 벤치마크에서 최신 장기 문맥 VLMs(LongVila, LlavaVideo, VideoChat-Flash, Qwen2.5-VL)을 사용한 실험 결과, MMInference는 1백만 토큰에서 최대 8.3배까지 사전 채우기 단계를 가속화하면서도 정확도를 유지하는 것으로 나타났습니다. 우리의 코드는 https://aka.ms/MMInference에서 확인할 수 있습니다.

English

The integration of long-context capabilities with visual understanding unlocks unprecedented potential for Vision Language Models (VLMs). However, the quadratic attention complexity during the pre-filling phase remains a significant obstacle to real-world deployment. To overcome this limitation, we introduce MMInference (Multimodality Million tokens Inference), a dynamic sparse attention method that accelerates the prefilling stage for long-context multi-modal inputs. First, our analysis reveals that the temporal and spatial locality of video input leads to a unique sparse pattern, the Grid pattern. Simultaneously, VLMs exhibit markedly different sparse distributions across different modalities. We introduce a permutation-based method to leverage the unique Grid pattern and handle modality boundary issues. By offline search the optimal sparse patterns for each head, MMInference constructs the sparse distribution dynamically based on the input. We also provide optimized GPU kernels for efficient sparse computations. Notably, MMInference integrates seamlessly into existing VLM pipelines without any model modifications or fine-tuning. Experiments on multi-modal benchmarks-including Video QA, Captioning, VisionNIAH, and Mixed-Modality NIAH-with state-of-the-art long-context VLMs (LongVila, LlavaVideo, VideoChat-Flash, Qwen2.5-VL) show that MMInference accelerates the pre-filling stage by up to 8.3x at 1M tokens while maintaining accuracy. Our code is available at https://aka.ms/MMInference.

MMInference: 모달리티 인지 순열 희소 주의를 통한 장문맥 VLM 사전 채우기 가속화

MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention

초록

Support