Flash-VStream：基于内存的长视频实时理解

摘要

受益于大型语言模型和跨模态对齐的进展，现有的多模态视频理解方法在离线场景中取得了显著的性能。然而，在现实世界中作为最常见的媒体形式之一的在线视频流却鲜少受到关注。与离线视频相比，在线视频流的“动态”特性给现有模型的直接应用带来了挑战，并引入了新问题，例如存储极长期信息、连续视觉内容与“异步”用户问题之间的交互。因此，在本文中，我们提出了Flash-VStream，这是一个模拟人类记忆机制的视频语言模型。我们的模型能够实时处理极长的视频流并同时回应用户查询。与现有模型相比，Flash-VStream 在推理延迟和VRAM消耗方面取得了显著的降低，这与在线流视频理解密切相关。此外，鉴于现有视频理解基准主要集中在离线场景，我们提出了VStream-QA，这是一个专门为在线视频流理解设计的新型问答基准。在提出的基准上与流行的现有方法进行比较，展示了我们的方法在这种具有挑战性的环境中的优越性。为验证我们方法的泛化能力，我们进一步在现有视频理解基准上进行评估，并在离线场景中取得了最先进的性能。所有代码、模型和数据集均可在 https://invinciblewyq.github.io/vstream-page/ 获取。

English

Benefiting from the advancements in large language models and cross-modal alignment, existing multi-modal video understanding methods have achieved prominent performance in offline scenario. However, online video streams, as one of the most common media forms in the real world, have seldom received attention. Compared to offline videos, the 'dynamic' nature of online video streams poses challenges for the direct application of existing models and introduces new problems, such as the storage of extremely long-term information, interaction between continuous visual content and 'asynchronous' user questions. Therefore, in this paper we present Flash-VStream, a video-language model that simulates the memory mechanism of human. Our model is able to process extremely long video streams in real-time and respond to user queries simultaneously. Compared to existing models, Flash-VStream achieves significant reductions in inference latency and VRAM consumption, which is intimately related to performing understanding of online streaming video. In addition, given that existing video understanding benchmarks predominantly concentrate on offline scenario, we propose VStream-QA, a novel question answering benchmark specifically designed for online video streaming understanding. Comparisons with popular existing methods on the proposed benchmark demonstrate the superiority of our method for such challenging setting. To verify the generalizability of our approach, we further evaluate it on existing video understanding benchmarks and achieves state-of-the-art performance in offline scenarios as well. All code, models, and datasets are available at the https://invinciblewyq.github.io/vstream-page/

Flash-VStream：基于内存的长视频实时理解

Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams

摘要

Support