注意以前に参加せよ：自己回帰的注視による効率的でスケーラブルな動画理解

要旨

マルチモーダル大規模言語モデル（MLLM）は汎用動画理解を進展させてきたが、長時間・高解像度の動画には課題を抱えている。既存手法では、視覚トランスフォーマー（ViT）やLLMにおいて時空間的な冗長性が大きいにもかかわらず、すべてのピクセルを均等に処理してしまう。本研究では、ViTやMLLMによる処理前に冗長なパッチを除去する軽量モジュール「AutoGaze」を提案する。次のトークン予測と強化学習により訓練されたAutoGazeは、ユーザー指定の誤差閾値内で動画を再構築可能な最小限のマルチスケールパッチセットを自己回帰的に選択し、情報を保持しつつ冗長性を排除する。実験では、AutoGazeが視覚トークンを4～100倍に削減し、ViTとMLLMを最大19倍高速化することを実証。これによりMLLMを1,000フレーム・4K解像度の動画にスケーリング可能とし、動画ベンチマークで優れた結果（例：VideoMMEで67.0%）を達成した。さらに、5分間の4K解像度動画を含む初の高解像度長尺動画QAベンチマーク「HLVid」を導入。AutoGazeで拡張したMLLMはベースラインを10.1%上回り、従来最高性能のMLLMを4.5%凌駕した。プロジェクトページ: https://autogaze.github.io/

English

Multi-modal large language models (MLLMs) have advanced general-purpose video understanding but struggle with long, high-resolution videos -- they process every pixel equally in their vision transformers (ViTs) or LLMs despite significant spatiotemporal redundancy. We introduce AutoGaze, a lightweight module that removes redundant patches before processed by a ViT or an MLLM. Trained with next-token prediction and reinforcement learning, AutoGaze autoregressively selects a minimal set of multi-scale patches that can reconstruct the video within a user-specified error threshold, eliminating redundancy while preserving information. Empirically, AutoGaze reduces visual tokens by 4x-100x and accelerates ViTs and MLLMs by up to 19x, enabling scaling MLLMs to 1K-frame 4K-resolution videos and achieving superior results on video benchmarks (e.g., 67.0% on VideoMME). Furthermore, we introduce HLVid: the first high-resolution, long-form video QA benchmark with 5-minute 4K-resolution videos, where an MLLM scaled with AutoGaze improves over the baseline by 10.1% and outperforms the previous best MLLM by 4.5%. Project page: https://autogaze.github.io/.

注意以前に参加せよ：自己回帰的注視による効率的でスケーラブルな動画理解

Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing

要旨

Support