VisionTrim: トレーニングフリーのMLLM高速化のための統合ビジョントークン圧縮

要旨

マルチモーダル大規模言語モデル（MLLM）は、特に高解像度および映像ベースのシナリオにおいて、過剰な視覚トークンにより高い計算コストが課題となっている。既存のトークン削減手法は、個別のパイプライン構成要素に焦点を当てることが多く、テキストとの整合性を軽視しがちで、性能低下を招く場合がある。本論文では、トレーニング不要のMLLM高速化のための統一フレームワーク「VisionTrim」を提案する。本フレームワークは、以下の2つの効果的なプラグアンドプレイモジュールを統合している：1）大域的・局所的視点から本質的な視覚トークンを保持するDominant Vision Token Selection（DVTS）モジュール、2）テキストの手がかりに基づく文脈を考慮したトークン統合を促進するText-Guided Vision Complement（TGVC）モジュールである。多様な画像・映像マルチモーダルベンチマークによる大規模な実験により、我々のVisionTrimが性能優位性を有し、実世界アプリケーションにおける実用的なMLLM展開を推進することを実証した。コードはhttps://github.com/hanxunyu/VisionTrim で公開されている。

English

Multimodal large language models (MLLMs) suffer from high computational costs due to excessive visual tokens, particularly in high-resolution and video-based scenarios. Existing token reduction methods typically focus on isolated pipeline components and often neglect textual alignment, leading to performance degradation. In this paper, we propose VisionTrim, a unified framework for training-free MLLM acceleration, integrating two effective plug-and-play modules: 1) the Dominant Vision Token Selection (DVTS) module, which preserves essential visual tokens via a global-local view, and 2) the Text-Guided Vision Complement (TGVC) module, which facilitates context-aware token merging guided by textual cues. Extensive experiments across diverse image and video multimodal benchmarks demonstrate the performance superiority of our VisionTrim, advancing practical MLLM deployment in real-world applications. The code is available at: https://github.com/hanxunyu/VisionTrim.

VisionTrim: トレーニングフリーのMLLM高速化のための統合ビジョントークン圧縮

VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration

要旨

Support