HoliTom: 高速ビデオ大規模言語モデルのためのホリスティックトークンマージング

要旨

ビデオ大規模言語モデル（video LLMs）はビデオ理解に優れているが、冗長なビデオトークンによる計算効率の低下が大きな課題である。既存のトークンプルーニング手法は解決策を提供するが、LLM内部で動作する手法（内部LLMプルーニング）、例えばFastVなどは、浅い層で本質的な計算オーバーヘッドを引き起こす。一方、LLMの前にトークンプルーニングを行う手法（外部LLMプルーニング）は、主に個々のフレーム内または限られた時間ウィンドウ内の空間的冗長性に対処し、長いビデオシーケンスにわたる重要なグローバルな時間的ダイナミクスと相関を無視している。これにより、時空間的な削減が最適ではなく、ビデオの圧縮性を十分に活用できていない。特に、これらの戦略を組み合わせた場合の相乗効果と相互影響は未解明のままである。冗長性をさらに削減するため、我々はHoliTomを提案する。これは、トレーニング不要なホリスティックなトークン統合フレームワークであり、グローバルな冗長性を考慮した時間的セグメンテーションによる外部LLMプルーニングを行い、その後、時空間的統合により視覚トークンを90%以上削減し、LLMの計算負荷を大幅に軽減する。これを補完するため、内部LLMトークンの類似性に基づく堅牢な統合手法を導入し、外部LLMプルーニングとの互換性と優れた性能を実現する。評価結果は、LLaVA-OneVision-7Bにおいて、計算コストをFLOPsの6.9%に削減しながら、元の性能の99.1%を維持するという効率と性能の有望なトレードオフを示している。さらに、Time-To-First-Token（TTFT）を2.28倍削減し、デコードスループットを1.32倍加速することで、効率的なビデオLLM推論のための統合プルーニング手法の実用的な利点を強調している。

English

Video large language models (video LLMs) excel at video comprehension but face significant computational inefficiency due to redundant video tokens. Existing token pruning methods offer solutions. However, approaches operating within the LLM (inner-LLM pruning), such as FastV, incur intrinsic computational overhead in shallow layers. In contrast, methods performing token pruning before the LLM (outer-LLM pruning) primarily address spatial redundancy within individual frames or limited temporal windows, neglecting the crucial global temporal dynamics and correlations across longer video sequences. This leads to sub-optimal spatio-temporal reduction and does not leverage video compressibility fully. Crucially, the synergistic potential and mutual influence of combining these strategies remain unexplored. To further reduce redundancy, we introduce HoliTom, a novel training-free holistic token merging framework. HoliTom employs outer-LLM pruning through global redundancy-aware temporal segmentation, followed by spatial-temporal merging to reduce visual tokens by over 90%, significantly alleviating the LLM's computational burden. Complementing this, we introduce a robust inner-LLM token similarity-based merging approach, designed for superior performance and compatibility with outer-LLM pruning. Evaluations demonstrate our method's promising efficiency-performance trade-off on LLaVA-OneVision-7B, reducing computational costs to 6.9% of FLOPs while maintaining 99.1% of the original performance. Furthermore, we achieve a 2.28x reduction in Time-To-First-Token (TTFT) and a 1.32x acceleration in decoding throughput, highlighting the practical benefits of our integrated pruning approach for efficient video LLMs inference.

HoliTom: 高速ビデオ大規模言語モデルのためのホリスティックトークンマージング

HoliTom: Holistic Token Merging for Fast Video Large Language Models

要旨

Support