ChatPaper.aiChatPaper

VideoLLaMB: 使用循环记忆进行长上下文视频理解

VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges

September 2, 2024
作者: Yuxuan Wang, Cihang Xie, Yang Liu, Zilong Zheng
cs.AI

摘要

最近大规模视频-语言模型的进展显示出实时规划和详细交互的巨大潜力。然而,它们高计算需求和标注数据集稀缺限制了学术研究者的实用性。在这项工作中,我们介绍了VideoLLaMB,这是一个新颖的框架,利用桥接层内的时间记忆标记,允许对整个视频序列进行编码,同时保留历史视觉数据,有效地保持语义连续性,并增强模型在各种任务中的性能。这种方法包括循环记忆标记和SceneTilling算法,将视频分割成独立的语义单元,以保持语义完整性。实证结果显示,VideoLLaMB在三个VideoQA基准测试中明显优于现有视频-语言模型,与竞争对手相比提高了5.5个点,在自我中心规划上提高了2.06个点。在MVBench上的综合结果显示,VideoLLaMB-7B的表现明显优于先前的相同LLM的7B模型。值得注意的是,即使视频长度增加到8倍,它仍然保持与PLLaVA相同的稳健性能。此外,在我们专门的Needle in a Video Haystack(NIAVH)基准测试中的帧检索结果进一步验证了VideoLLaMB在准确识别长视频中特定帧的能力。我们的SceneTilling算法还能够直接生成流式视频字幕,无需额外训练。在效率方面,VideoLLaMB在16帧上训练,在单个Nvidia A100 GPU上支持高达320帧,具有线性GPU内存扩展,确保高性能和成本效益,从而为学术和实际应用中的长视频-语言模型奠定了新基础。
English
Recent advancements in large-scale video-language models have shown significant potential for real-time planning and detailed interactions. However, their high computational demands and the scarcity of annotated datasets limit their practicality for academic researchers. In this work, we introduce VideoLLaMB, a novel framework that utilizes temporal memory tokens within bridge layers to allow for the encoding of entire video sequences alongside historical visual data, effectively preserving semantic continuity and enhancing model performance across various tasks. This approach includes recurrent memory tokens and a SceneTilling algorithm, which segments videos into independent semantic units to preserve semantic integrity. Empirically, VideoLLaMB significantly outstrips existing video-language models, demonstrating a 5.5 points improvement over its competitors across three VideoQA benchmarks, and 2.06 points on egocentric planning. Comprehensive results on the MVBench show that VideoLLaMB-7B achieves markedly better results than previous 7B models of same LLM. Remarkably, it maintains robust performance as PLLaVA even as video length increases up to 8 times. Besides, the frame retrieval results on our specialized Needle in a Video Haystack (NIAVH) benchmark, further validate VideoLLaMB's prowess in accurately identifying specific frames within lengthy videos. Our SceneTilling algorithm also enables the generation of streaming video captions directly, without necessitating additional training. In terms of efficiency, VideoLLaMB, trained on 16 frames, supports up to 320 frames on a single Nvidia A100 GPU with linear GPU memory scaling, ensuring both high performance and cost-effectiveness, thereby setting a new foundation for long-form video-language models in both academic and practical applications.

Summary

AI-Generated Summary

PDF286November 16, 2024