ChatPaper.aiChatPaper

VideoNSA:原生稀疏注意力機制提升視頻理解能力

VideoNSA: Native Sparse Attention Scales Video Understanding

October 2, 2025
作者: Enxin Song, Wenhao Chai, Shusheng Yang, Ethan Armand, Xiaojun Shan, Haiyang Xu, Jianwen Xie, Zhuowen Tu
cs.AI

摘要

多模态语言模型中的视频理解仍受限于上下文长度:模型常遗漏关键过渡帧,难以在长时间尺度上保持连贯性。为此,我们将原生稀疏注意力机制(NSA)适配于视频-语言模型。我们的方法VideoNSA,通过在216K视频指令数据集上进行端到端训练,对Qwen2.5-VL进行了调整。我们采用了一种硬件感知的混合注意力策略,对文本保留密集注意力,而对视频则应用NSA。与基于令牌压缩和无训练稀疏基线相比,VideoNSA在长视频理解、时序推理及空间基准测试上均取得了性能提升。进一步的消融分析揭示了四个关键发现:(1) 可扩展至128K令牌的可靠性;(2) 在固定预算下,全局与局部注意力的最优分配;(3) 任务依赖的分支使用模式;以及(4) 可学习的组合稀疏注意力有助于诱导动态注意力汇聚点。
English
Video understanding in multimodal language models remains limited by context length: models often miss key transition frames and struggle to maintain coherence across long time scales. To address this, we adapt Native Sparse Attention (NSA) to video-language models. Our method, VideoNSA, adapts Qwen2.5-VL through end-to-end training on a 216K video instruction dataset. We employ a hardware-aware hybrid approach to attention, preserving dense attention for text, while employing NSA for video. Compared to token-compression and training-free sparse baselines, VideoNSA achieves improved performance on long-video understanding, temporal reasoning, and spatial benchmarks. Further ablation analysis reveals four key findings: (1) reliable scaling to 128K tokens; (2) an optimal global-local attention allocation at a fixed budget; (3) task-dependent branch usage patterns; and (4) the learnable combined sparse attention help induce dynamic attention sinks.
PDF92October 3, 2025