MLLMにおけるトークン削減の再考：トレーニングフリーの加速のための統一されたパラダイムに向けて

要旨

重いMultimodal Large Language Models（MLLMs）の推論を加速するために、この研究はトレーニング不要のトークン削減研究の現在の状況を見直しました。既存の手法の重要な部分が密接に絡み合っており、その相互関係や効果が比較、転送、拡張のためにはっきりとしておらず、残念に思っています。そのため、私たちはトークン削減をパイプライン内の3つの異なる段階に分解する統一された「フィルター-相関-圧縮」パラダイムを提案します。このパラダイムは一貫した設計目標と要素を維持しながら、独自の実装を可能にします。さらに、一般的な作品を解明し、その普遍性を示すためにそれらをパラダイムに包含します。最後に、推論のさまざまな段階で速度と精度のバランスを保ちつつ、パラダイムに基づいた一連の手法を提供します。10のベンチマークを通じた実験結果は、私たちの手法がFLOPsを最大82.4%削減し、パフォーマンスにほとんど影響を与えず、同時に最先端のトレーニング不要の手法を凌駕していることを示しています。私たちのプロジェクトページはhttps://ficoco-accelerate.github.io/ にあります。

English

To accelerate the inference of heavy Multimodal Large Language Models (MLLMs), this study rethinks the current landscape of training-free token reduction research. We regret to find that the critical components of existing methods are tightly intertwined, with their interconnections and effects remaining unclear for comparison, transfer, and expansion. Therefore, we propose a unified ''filter-correlate-compress'' paradigm that decomposes the token reduction into three distinct stages within a pipeline, maintaining consistent design objectives and elements while allowing for unique implementations. We additionally demystify the popular works and subsume them into our paradigm to showcase its universality. Finally, we offer a suite of methods grounded in the paradigm, striking a balance between speed and accuracy throughout different phases of the inference. Experimental results across 10 benchmarks indicate that our methods can achieve up to an 82.4% reduction in FLOPs with a minimal impact on performance, simultaneously surpassing state-of-the-art training-free methods. Our project page is at https://ficoco-accelerate.github.io/.