EarlyTom: 早期トークン圧縮が高速な動画理解を実現

要旨

ビデオ大規模言語モデル（ビデオLLM）は、ビデオ理解タスクにおいて高い能力を示している。しかし、その実用的な展開は、膨大な数の視覚トークンを処理することによる非効率性によって依然として妨げられている。最近のアプローチでは、全トークンベースラインと同等の精度を維持しながら極めて低いトークン保持率を達成しているが、そのほとんどはプリフィリングの後期段階でのみ圧縮を行っており、視覚エンコーダの効率性は最適化されていない。本論文では、まず視覚エンコードが最初のトークンまでの時間（TTFT）に大きな割合を占めることを示す。したがって、視覚エンコーダの後でのみ視覚トークンを圧縮するのではなく、エンコーダ内部で圧縮を行うことには、まだ十分な探求の余地がある。この洞察に基づき、学習不要のトークン圧縮フレームワークであるEarlyTomを提案する。これは視覚エンコーダ内部で早期段階の視覚トークン圧縮を実行し、TTFTの大幅な削減とより高いスループットを実現する。さらに、圧縮効果全体を向上させる分離された空間トークン選択戦略を導入する。EarlyTomは、単一のNVIDIA A100 GPU上でLLaVA-OneVision-7Bモデルに対して、TTFTを最大2.65倍、フロップスを最大61%削減し、全トークンベースラインと同等の精度を維持する。これらの改善により、実世界の本番環境でビデオLLMを展開する実用性が大幅に向上する。

English

Video large language models (Video-LLMs) have demonstrated strong capabilities in video understanding tasks. However, their practical deployment is still hindered by the inefficiency introduced by processing massive amounts of visual tokens. Although recent approaches achieve extremely low token retention ratios while maintaining accuracy comparable to full-token baselines, most of them perform compression only at the late stage of prefilling, leaving the efficiency of the vision encoder unoptimized. In this paper, we first show that vision encoding contributes a large portion to the time-to-first-token (TTFT). Therefore, instead of compressing visual tokens only after the vision encoder, performing compression inside the encoder still leaves substantial room for exploration. Based on this insight, we propose EarlyTom, a training-free token compression framework that performs early-stage visual token compression inside the vision encoder, enabling significantly better TTFT reduction and higher throughput. In addition, we introduce a decoupled spatial token selection strategy that improves the overall compression effectiveness. EarlyTom reduces TTFT by up to 2.65x and FLOPs by up to 61% on a single NVIDIA A100 GPU for the LLaVA-OneVision-7B model, while maintaining accuracy comparable to the full-token baseline. These improvements substantially enhance the practicality of deploying Video-LLMs in real-world production scenarios.