ChatPaper.aiChatPaper

Flash-VStream:基於記憶的長視頻即時理解

Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams

June 12, 2024
作者: Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, Xiaojie Jin
cs.AI

摘要

受益於大型語言模型和跨模態對齊的進展,現有的多模態視頻理解方法在離線場景中取得了顯著的表現。然而,在現實世界中作為最常見的媒體形式之一,線上視頻流卻鮮少受到關注。與離線視頻相比,線上視頻流的「動態」特性對現有模型的直接應用提出了挑戰,並引入了新問題,例如存儲極長期信息、連續視覺內容與「異步」用戶問題之間的交互作用。因此,在本文中,我們提出了Flash-VStream,一種模擬人類記憶機制的視頻語言模型。我們的模型能夠實時處理極長的視頻流並同時回應用戶查詢。與現有模型相比,Flash-VStream在推理延遲和VRAM消耗方面實現了顯著的降低,這與對線上流視頻進行理解密切相關。此外,鑒於現有的視頻理解基準主要集中在離線場景,我們提出了VStream-QA,一個專門為線上視頻流理解設計的新型問答基準。在所提出的基準上與流行的現有方法進行比較,證明了我們的方法在這種具有挑戰性的設置下的優越性。為驗證我們方法的泛化能力,我們進一步在現有的視頻理解基準上進行評估,並在離線場景中實現了最先進的性能。所有代碼、模型和數據集均可在https://invinciblewyq.github.io/vstream-page/ 上獲得。
English
Benefiting from the advancements in large language models and cross-modal alignment, existing multi-modal video understanding methods have achieved prominent performance in offline scenario. However, online video streams, as one of the most common media forms in the real world, have seldom received attention. Compared to offline videos, the 'dynamic' nature of online video streams poses challenges for the direct application of existing models and introduces new problems, such as the storage of extremely long-term information, interaction between continuous visual content and 'asynchronous' user questions. Therefore, in this paper we present Flash-VStream, a video-language model that simulates the memory mechanism of human. Our model is able to process extremely long video streams in real-time and respond to user queries simultaneously. Compared to existing models, Flash-VStream achieves significant reductions in inference latency and VRAM consumption, which is intimately related to performing understanding of online streaming video. In addition, given that existing video understanding benchmarks predominantly concentrate on offline scenario, we propose VStream-QA, a novel question answering benchmark specifically designed for online video streaming understanding. Comparisons with popular existing methods on the proposed benchmark demonstrate the superiority of our method for such challenging setting. To verify the generalizability of our approach, we further evaluate it on existing video understanding benchmarks and achieves state-of-the-art performance in offline scenarios as well. All code, models, and datasets are available at the https://invinciblewyq.github.io/vstream-page/

Summary

AI-Generated Summary

PDF171November 28, 2024