释放小时级视频训练潜力,助力长视频语言理解
Unleashing Hour-Scale Video Training for Long Video-Language Understanding
June 5, 2025
作者: Jingyang Lin, Jialian Wu, Ximeng Sun, Ze Wang, Jiang Liu, Yusheng Su, Xiaodong Yu, Hao Chen, Jiebo Luo, Zicheng Liu, Emad Barsoum
cs.AI
摘要
近期,长视频-语言理解基准推动了视频大型多模态模型(Video-LMMs)的发展。然而,高质量标注的长视频资源匮乏,导致针对时长长达一小时的视频大语言模型(Video-LLMs)的训练研究尚不充分。为填补这一空白,我们推出了VideoMarathon,一个大规模的长视频指令跟随数据集。该数据集包含约9,700小时的长视频,视频时长从3分钟到60分钟不等,涵盖多个领域。具体而言,VideoMarathon包含了330万对高质量问答对,覆盖六个基本主题:时间性、空间性、物体、动作、场景和事件。与现有视频指令数据集相比,VideoMarathon显著将训练视频时长扩展至1小时,并支持22项需要短期与长期视频理解能力的多样化任务。基于VideoMarathon,我们提出了Hour-LLaVA,一个强大且高效的视频-语言模型,适用于小时级别的视频-语言建模。通过引入记忆增强模块,Hour-LLaVA能够以每秒1帧的采样率进行长达一小时的视频训练与推理,该模块自适应地整合了用户问题相关及时空信息丰富的语义,这些语义源自缓存的完整视频上下文。实验结果表明,Hour-LLaVA在多个长视频-语言基准测试中均取得了最佳性能,充分验证了VideoMarathon数据集的高质量及Hour-LLaVA模型的优越性。
English
Recent long-form video-language understanding benchmarks have driven progress
in video large multimodal models (Video-LMMs). However, the scarcity of
well-annotated long videos has left the training of hour-long Video-LLMs
underexplored. To close this gap, we present VideoMarathon, a large-scale
hour-long video instruction-following dataset. This dataset includes around
9,700 hours of long videos sourced from diverse domains, ranging from 3 to 60
minutes per video. Specifically, it contains 3.3M high-quality QA pairs,
spanning six fundamental topics: temporality, spatiality, object, action,
scene, and event. Compared to existing video instruction datasets,
VideoMarathon significantly extends training video durations up to 1 hour, and
supports 22 diverse tasks requiring both short- and long-term video
comprehension. Building on VideoMarathon, we propose Hour-LLaVA, a powerful and
efficient Video-LMM for hour-scale video-language modeling. It enables
hour-long video training and inference at 1-FPS sampling by leveraging a memory
augmentation module, which adaptively integrates user question-relevant and
spatiotemporal-informative semantics from a cached full video context. In our
experiments, Hour-LLaVA achieves the best performance on multiple long
video-language benchmarks, demonstrating the high quality of the VideoMarathon
dataset and the superiority of the Hour-LLaVA model.