MMInference: モダリティ認識型順列スパースアテンションによる長文脈VLMの事前埋め込み高速化

要旨

長文脈処理能力と視覚理解の統合は、Vision Language Models（VLM）に前例のない可能性をもたらします。しかし、プリフィリング段階における二次的な注意複雑性は、実世界での展開における重大な障壁となっています。この制限を克服するため、我々はMMInference（Multimodality Million tokens Inference）を提案します。これは、長文脈マルチモーダル入力のプリフィリング段階を加速する動的スパース注意メソッドです。まず、我々の分析により、ビデオ入力の時間的および空間的局所性が、グリッドパターンという独特のスパースパターンを生み出すことが明らかになりました。同時に、VLMは異なるモダリティ間で著しく異なるスパース分布を示します。我々は、この独特のグリッドパターンを活用し、モダリティ境界の問題を処理するための順列ベースの手法を導入します。各ヘッドの最適なスパースパターンをオフラインで探索することで、MMInferenceは入力に基づいてスパース分布を動的に構築します。また、効率的なスパース計算のための最適化されたGPUカーネルも提供します。特に、MMInferenceは既存のVLMパイプラインにシームレスに統合され、モデルの変更やファインチューニングを必要としません。Video QA、Captioning、VisionNIAH、Mixed-Modality NIAHなどのマルチモーダルベンチマークにおける実験では、最先端の長文脈VLM（LongVila、LlavaVideo、VideoChat-Flash、Qwen2.5-VL）を使用し、MMInferenceが1Mトークンにおいてプリフィリング段階を最大8.3倍加速しつつ、精度を維持することを示しました。我々のコードはhttps://aka.ms/MMInferenceで公開されています。

English

The integration of long-context capabilities with visual understanding unlocks unprecedented potential for Vision Language Models (VLMs). However, the quadratic attention complexity during the pre-filling phase remains a significant obstacle to real-world deployment. To overcome this limitation, we introduce MMInference (Multimodality Million tokens Inference), a dynamic sparse attention method that accelerates the prefilling stage for long-context multi-modal inputs. First, our analysis reveals that the temporal and spatial locality of video input leads to a unique sparse pattern, the Grid pattern. Simultaneously, VLMs exhibit markedly different sparse distributions across different modalities. We introduce a permutation-based method to leverage the unique Grid pattern and handle modality boundary issues. By offline search the optimal sparse patterns for each head, MMInference constructs the sparse distribution dynamically based on the input. We also provide optimized GPU kernels for efficient sparse computations. Notably, MMInference integrates seamlessly into existing VLM pipelines without any model modifications or fine-tuning. Experiments on multi-modal benchmarks-including Video QA, Captioning, VisionNIAH, and Mixed-Modality NIAH-with state-of-the-art long-context VLMs (LongVila, LlavaVideo, VideoChat-Flash, Qwen2.5-VL) show that MMInference accelerates the pre-filling stage by up to 8.3x at 1M tokens while maintaining accuracy. Our code is available at https://aka.ms/MMInference.

MMInference: モダリティ認識型順列スパースアテンションによる長文脈VLMの事前埋め込み高速化

MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention

要旨

Support