ChatPaper.aiChatPaper

BIMBA:面向长距离视频问答的选择性扫描压缩技术

BIMBA: Selective-Scan Compression for Long-Range Video Question Answering

March 12, 2025
作者: Md Mohaiminul Islam, Tushar Nagarajan, Huiyu Wang, Gedas Bertasius, Lorenzo Torresani
cs.AI

摘要

長視頻中的視頻問答(VQA)面臨著從大量冗餘幀中提取相關信息並建模長程依賴關係的關鍵挑戰。自注意力機制為序列建模提供了一種通用解決方案,但在處理長視頻中大量時空標記時,其計算成本過高。大多數先前的方法依賴於壓縮策略來降低計算成本,例如通過稀疏幀採樣減少輸入長度,或通過時空池化壓縮傳遞給大型語言模型(LLM)的輸出序列。然而,這些簡單的方法過度表示冗餘信息,往往會錯過顯著事件或快速發生的時空模式。在本研究中,我們引入了BIMBA,一種高效的狀態空間模型來處理長視頻。我們的模型利用選擇性掃描算法,學習從高維視頻中有效選擇關鍵信息,並將其轉換為簡化的標記序列,以便LLM高效處理。大量實驗表明,BIMBA在多個長視頻VQA基準測試中達到了最先進的準確率,包括PerceptionTest、NExT-QA、EgoSchema、VNBench、LongVideoBench和Video-MME。代碼和模型已公開於https://sites.google.com/view/bimba-mllm。
English
Video Question Answering (VQA) in long videos poses the key challenge of extracting relevant information and modeling long-range dependencies from many redundant frames. The self-attention mechanism provides a general solution for sequence modeling, but it has a prohibitive cost when applied to a massive number of spatiotemporal tokens in long videos. Most prior methods rely on compression strategies to lower the computational cost, such as reducing the input length via sparse frame sampling or compressing the output sequence passed to the large language model (LLM) via space-time pooling. However, these naive approaches over-represent redundant information and often miss salient events or fast-occurring space-time patterns. In this work, we introduce BIMBA, an efficient state-space model to handle long-form videos. Our model leverages the selective scan algorithm to learn to effectively select critical information from high-dimensional video and transform it into a reduced token sequence for efficient LLM processing. Extensive experiments demonstrate that BIMBA achieves state-of-the-art accuracy on multiple long-form VQA benchmarks, including PerceptionTest, NExT-QA, EgoSchema, VNBench, LongVideoBench, and Video-MME. Code, and models are publicly available at https://sites.google.com/view/bimba-mllm.

Summary

AI-Generated Summary

PDF32March 13, 2025