効率的なビデオ大規模言語モデルのための局所的・大規模的文脈最適化に基づくトークン削減

要旨

ビデオ大規模言語モデル（VLLM）は優れた映像理解能力を示すが、冗長な視覚トークンによる非効率性が課題である。既存の剪定手法は主にフレーム内の空間的冗長性を対象とするか、浅い層のオーバーヘッドでLLM内部を剪定するため、時空間的な削減が最適ではなく、長文脈の圧縮可能性を十分に活用できていない。さらに、これらの手法は統合または剪定されたトークンから微妙ながら情報量の多い文脈を廃棄しがちである。本論文では、フレーム内およびフレーム間におけるトークンアンカーを精緻化し、局所-大域的最適輸送（AOT）を通じて情報量の多い文脈を包括的に集約する新たな視点を提案する。具体的には、まず注意機構の誘導に基づいて各フレーム内に局所・大域認識トークンアンカーを確立し、最適輸送によって剪定トークンから情報文脈を集約することで、フレーム内トークンアンカーを構築する。次に、時間的なフレームクリップに基づき、各クリップの最初のフレームをキーフレームアンカーとして、連続するフレームから類似情報を最適輸送を通じて集約する一方、時間的動態を表現するための特徴的なトークンを保持することで、トレーニング不要な方法で効率的なトークン削減を実現する。大規模な評価により、提案するAOTが主要なビデオLLMにおいて様々な短編・長編ビデオベンチマークで競争力のある性能を達成し、時間的・視覚的忠実性を保ちながら大幅な計算効率向上を実現することを示す。プロジェクトWebページ: https://tyroneli.github.io/AOT{AOT}。

English

Video Large Language Models (VLLMs) demonstrate strong video understanding but suffer from inefficiency due to redundant visual tokens. Existing pruning primary targets intra-frame spatial redundancy or prunes inside the LLM with shallow-layer overhead, yielding suboptimal spatiotemporal reduction and underutilizing long-context compressibility. All of them often discard subtle yet informative context from merged or pruned tokens. In this paper, we propose a new perspective that elaborates token Anchors within intra-frame and inter-frame to comprehensively aggregate the informative contexts via local-global Optimal Transport (AOT). Specifically, we first establish local- and global-aware token anchors within each frame under the attention guidance, which then optimal transport aggregates the informative contexts from pruned tokens, constructing intra-frame token anchors. Then, building on the temporal frame clips, the first frame within each clip will be considered as the keyframe anchors to ensemble similar information from consecutive frames through optimal transport, while keeping distinct tokens to represent temporal dynamics, leading to efficient token reduction in a training-free manner. Extensive evaluations show that our proposed AOT obtains competitive performances across various short- and long-video benchmarks on leading video LLMs, obtaining substantial computational efficiency while preserving temporal and visual fidelity. Project webpage: https://tyroneli.github.io/AOT{AOT}.

効率的なビデオ大規模言語モデルのための局所的・大規模的文脈最適化に基づくトークン削減

Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models

要旨

Support