Flash-VStream:基于内存的长视频实时理解
Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams
June 12, 2024
作者: Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, Xiaojie Jin
cs.AI
摘要
受益于大型语言模型和跨模态对齐的进展,现有的多模态视频理解方法在离线场景中取得了显著的性能。然而,在现实世界中作为最常见的媒体形式之一的在线视频流却鲜少受到关注。与离线视频相比,在线视频流的“动态”特性给现有模型的直接应用带来了挑战,并引入了新问题,例如存储极长期信息、连续视觉内容与“异步”用户问题之间的交互。因此,在本文中,我们提出了Flash-VStream,这是一个模拟人类记忆机制的视频语言模型。我们的模型能够实时处理极长的视频流并同时回应用户查询。与现有模型相比,Flash-VStream 在推理延迟和VRAM消耗方面取得了显著的降低,这与在线流视频理解密切相关。此外,鉴于现有视频理解基准主要集中在离线场景,我们提出了VStream-QA,这是一个专门为在线视频流理解设计的新型问答基准。在提出的基准上与流行的现有方法进行比较,展示了我们的方法在这种具有挑战性的环境中的优越性。为验证我们方法的泛化能力,我们进一步在现有视频理解基准上进行评估,并在离线场景中取得了最先进的性能。所有代码、模型和数据集均可在 https://invinciblewyq.github.io/vstream-page/ 获取。
English
Benefiting from the advancements in large language models and cross-modal
alignment, existing multi-modal video understanding methods have achieved
prominent performance in offline scenario. However, online video streams, as
one of the most common media forms in the real world, have seldom received
attention. Compared to offline videos, the 'dynamic' nature of online video
streams poses challenges for the direct application of existing models and
introduces new problems, such as the storage of extremely long-term
information, interaction between continuous visual content and 'asynchronous'
user questions. Therefore, in this paper we present Flash-VStream, a
video-language model that simulates the memory mechanism of human. Our model is
able to process extremely long video streams in real-time and respond to user
queries simultaneously. Compared to existing models, Flash-VStream achieves
significant reductions in inference latency and VRAM consumption, which is
intimately related to performing understanding of online streaming video. In
addition, given that existing video understanding benchmarks predominantly
concentrate on offline scenario, we propose VStream-QA, a novel question
answering benchmark specifically designed for online video streaming
understanding. Comparisons with popular existing methods on the proposed
benchmark demonstrate the superiority of our method for such challenging
setting. To verify the generalizability of our approach, we further evaluate it
on existing video understanding benchmarks and achieves state-of-the-art
performance in offline scenarios as well. All code, models, and datasets are
available at the https://invinciblewyq.github.io/vstream-page/Summary
AI-Generated Summary