VideoNSA: ネイティブスパースアテンションによるビデオ理解のスケーリング

要旨

マルチモーダル言語モデルにおけるビデオ理解は、コンテキスト長の制約によって依然として限界がある。モデルはしばしば重要な遷移フレームを見落とし、長時間スケールにわたる一貫性を維持するのに苦労する。この問題に対処するため、我々はNative Sparse Attention（NSA）をビデオ言語モデルに適用した。我々の手法であるVideoNSAは、216Kのビデオ指示データセットを用いたエンドツーエンドのトレーニングを通じてQwen2.5-VLを適応させる。ハードウェアを意識したハイブリッドアプローチを採用し、テキストには密なアテンションを保持しつつ、ビデオにはNSAを適用する。トークン圧縮やトレーニング不要のスパースベースラインと比較して、VideoNSAは長時間ビデオ理解、時間的推論、空間的ベンチマークにおいて改善された性能を達成する。さらに、アブレーション分析を通じて以下の4つの重要な知見が得られた：（1）128Kトークンへの信頼性のあるスケーリング、（2）固定予算における最適なグローバル-ローカルアテンション配分、（3）タスク依存のブランチ使用パターン、（4）学習可能な結合スパースアテンションが動的アテンションシンクを誘導するのに役立つこと。

English

Video understanding in multimodal language models remains limited by context length: models often miss key transition frames and struggle to maintain coherence across long time scales. To address this, we adapt Native Sparse Attention (NSA) to video-language models. Our method, VideoNSA, adapts Qwen2.5-VL through end-to-end training on a 216K video instruction dataset. We employ a hardware-aware hybrid approach to attention, preserving dense attention for text, while employing NSA for video. Compared to token-compression and training-free sparse baselines, VideoNSA achieves improved performance on long-video understanding, temporal reasoning, and spatial benchmarks. Further ablation analysis reveals four key findings: (1) reliable scaling to 128K tokens; (2) an optimal global-local attention allocation at a fixed budget; (3) task-dependent branch usage patterns; and (4) the learnable combined sparse attention help induce dynamic attention sinks.

VideoNSA: ネイティブスパースアテンションによるビデオ理解のスケーリング

VideoNSA: Native Sparse Attention Scales Video Understanding

要旨

Support