VideoNSA: 네이티브 희소 어텐션이 비디오 이해의 확장을 가능하게 함

초록

다중모달 언어 모델에서의 비디오 이해는 여전히 컨텍스트 길이에 의해 제한됩니다: 모델들은 종종 중요한 전환 프레임을 놓치고 긴 시간 규모에서의 일관성을 유지하는 데 어려움을 겪습니다. 이를 해결하기 위해, 우리는 Native Sparse Attention(NSA)을 비디오-언어 모델에 적용했습니다. 우리의 방법인 VideoNSA는 216K 비디오 명령어 데이터셋에 대한 종단간 학습을 통해 Qwen2.5-VL을 적응시킵니다. 우리는 하드웨어를 고려한 하이브리드 접근 방식을 사용하여 텍스트에는 밀집 어텐션을 유지하고, 비디오에는 NSA를 적용합니다. 토큰 압축 및 학습이 필요 없는 희소 기반선과 비교했을 때, VideoNSA는 긴 비디오 이해, 시간적 추론, 공간적 벤치마크에서 향상된 성능을 달성했습니다. 추가적인 절제 분석을 통해 네 가지 주요 결과를 발견했습니다: (1) 128K 토큰까지의 안정적인 확장; (2) 고정된 예산에서의 최적의 전역-지역 어텐션 할당; (3) 작업에 따른 분기 사용 패턴; 그리고 (4) 학습 가능한 결합 희소 어텐션이 동적 어텐션 싱크를 유도하는 데 도움을 준다는 점입니다.

English

Video understanding in multimodal language models remains limited by context length: models often miss key transition frames and struggle to maintain coherence across long time scales. To address this, we adapt Native Sparse Attention (NSA) to video-language models. Our method, VideoNSA, adapts Qwen2.5-VL through end-to-end training on a 216K video instruction dataset. We employ a hardware-aware hybrid approach to attention, preserving dense attention for text, while employing NSA for video. Compared to token-compression and training-free sparse baselines, VideoNSA achieves improved performance on long-video understanding, temporal reasoning, and spatial benchmarks. Further ablation analysis reveals four key findings: (1) reliable scaling to 128K tokens; (2) an optimal global-local attention allocation at a fixed budget; (3) task-dependent branch usage patterns; and (4) the learnable combined sparse attention help induce dynamic attention sinks.

VideoNSA: 네이티브 희소 어텐션이 비디오 이해의 확장을 가능하게 함

VideoNSA: Native Sparse Attention Scales Video Understanding

초록

Support