QuoTA: 長尺動画理解のためのCoTクエリ分離に基づくクエリ指向トークン割り当て

要旨

長尺動画理解における最近の進展では、通常、注意分布に基づく視覚トークンのプルーニングを通じて視覚的な冗長性を軽減しています。しかし、既存の手法はデコーダ層での事後的な低応答トークンのプルーニングを採用している一方で、視覚トークンと指示（クエリ）間の入力レベルの意味的相関を見落としています。本論文では、既存の大規模動画言語モデル（LVLM）を拡張し、クエリ指向のフレームレベル重要度評価に基づく視覚トークン割り当てを行う、訓練不要のモジュールであるQuoTAを提案します。クエリ指向のトークン選択は、視覚処理をタスク固有の要件に合わせることで、トークン予算の効率的な利用を最適化しつつ、意味的に関連するコンテンツを保持するために重要です。具体的には、(i) QuoTAはクエリ関連性に基づいてフレームレベル重要度スコアを戦略的に割り当て、デコーダ層でのクロスモーダル相互作用前に一度だけ視覚トークンを割り当てることを可能にし、(ii) Chain-of-Thoughts推論を通じてクエリを分離し、より正確なLVLMベースのフレーム重要度スコアリングを促進し、(iii) QuoTAは既存のLVLMに拡張可能なプラグアンドプレイ機能を提供します。広範な実験結果は、LLaVA-Video-7BにQuoTAを実装することで、ベースラインと同一の視覚トークン予算内で動作しながら、Video-MMEやMLVUを含む6つのベンチマークで平均3.2%の性能向上を達成することを示しています。コードはhttps://github.com/MAC-AutoML/QuoTAで公開されています。

English

Recent advances in long video understanding typically mitigate visual redundancy through visual token pruning based on attention distribution. However, while existing methods employ post-hoc low-response token pruning in decoder layers, they overlook the input-level semantic correlation between visual tokens and instructions (query). In this paper, we propose QuoTA, an ante-hoc training-free modular that extends existing large video-language models (LVLMs) for visual token assignment based on query-oriented frame-level importance assessment. The query-oriented token selection is crucial as it aligns visual processing with task-specific requirements, optimizing token budget utilization while preserving semantically relevant content. Specifically, (i) QuoTA strategically allocates frame-level importance scores based on query relevance, enabling one-time visual token assignment before cross-modal interactions in decoder layers, (ii) we decouple the query through Chain-of-Thoughts reasoning to facilitate more precise LVLM-based frame importance scoring, and (iii) QuoTA offers a plug-and-play functionality that extends to existing LVLMs. Extensive experimental results demonstrate that implementing QuoTA with LLaVA-Video-7B yields an average performance improvement of 3.2% across six benchmarks (including Video-MME and MLVU) while operating within an identical visual token budget as the baseline. Codes are open-sourced at https://github.com/MAC-AutoML/QuoTA.

QuoTA: 長尺動画理解のためのCoTクエリ分離に基づくクエリ指向トークン割り当て

QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension

要旨

Support