VideoLLaMB:使用循環記憶進行長文本影片理解
VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges
September 2, 2024
作者: Yuxuan Wang, Cihang Xie, Yang Liu, Zilong Zheng
cs.AI
摘要
最近大規模視訊語言模型的進展顯示了實時規劃和詳細互動的顯著潛力。然而,它們高計算需求和標註數據稀缺限制了對學術研究人員的實用性。在這項工作中,我們介紹了VideoLLaMB,一個新穎的框架,利用橋接層內的時間記憶標記,允許對整個視訊序列進行編碼,同時保留歷史視覺數據,有效地保持語義連貫性,增強模型在各種任務中的性能。這種方法包括遞歸記憶標記和SceneTilling算法,將視訊分段為獨立的語義單元,以保持語義完整性。根據實證,VideoLLaMB在三個VideoQA基準測試中明顯優於現有的視訊語言模型,相對競爭對手提升了5.5個百分點,而在自我中心規劃方面提升了2.06個百分點。在MVBench的全面結果顯示,VideoLLaMB-7B的表現明顯優於先前的相同LLM 7B模型。值得注意的是,即使視訊長度增加至8倍,它仍像PLLaVA一樣保持穩健的性能。此外,在我們專門的Needle in a Video Haystack(NIAVH)基準測試中的幀檢索結果進一步驗證了VideoLLaMB在準確識別長視訊中特定幀的能力。我們的SceneTilling算法還能直接生成流式視訊字幕,無需額外訓練。在效率方面,訓練16幀的VideoLLaMB,在單個Nvidia A100 GPU上支持320幀,具有線性GPU內存擴展,確保高性能和成本效益,從而為學術和實際應用中的長視訊語言模型奠定了新基礎。
English
Recent advancements in large-scale video-language models have shown
significant potential for real-time planning and detailed interactions.
However, their high computational demands and the scarcity of annotated
datasets limit their practicality for academic researchers. In this work, we
introduce VideoLLaMB, a novel framework that utilizes temporal memory tokens
within bridge layers to allow for the encoding of entire video sequences
alongside historical visual data, effectively preserving semantic continuity
and enhancing model performance across various tasks. This approach includes
recurrent memory tokens and a SceneTilling algorithm, which segments videos
into independent semantic units to preserve semantic integrity. Empirically,
VideoLLaMB significantly outstrips existing video-language models,
demonstrating a 5.5 points improvement over its competitors across three
VideoQA benchmarks, and 2.06 points on egocentric planning. Comprehensive
results on the MVBench show that VideoLLaMB-7B achieves markedly better results
than previous 7B models of same LLM. Remarkably, it maintains robust
performance as PLLaVA even as video length increases up to 8 times. Besides,
the frame retrieval results on our specialized Needle in a Video Haystack
(NIAVH) benchmark, further validate VideoLLaMB's prowess in accurately
identifying specific frames within lengthy videos. Our SceneTilling algorithm
also enables the generation of streaming video captions directly, without
necessitating additional training. In terms of efficiency, VideoLLaMB, trained
on 16 frames, supports up to 320 frames on a single Nvidia A100 GPU with linear
GPU memory scaling, ensuring both high performance and cost-effectiveness,
thereby setting a new foundation for long-form video-language models in both
academic and practical applications.Summary
AI-Generated Summary