BIMBA: 장거리 비디오 질문 응답을 위한 선택적 스캔 압축 기법

초록

장기 비디오에서의 Video Question Answering(VQA)은 많은 중복 프레임들로부터 관련 정보를 추출하고 장거리 의존성을 모델링하는 데 있어 주요한 도전 과제를 제시합니다. 자기 주의 메커니즘(self-attention mechanism)은 시퀀스 모델링을 위한 일반적인 해결책을 제공하지만, 장기 비디오에서의 방대한 시공간 토큰들에 적용할 경우 과도한 계산 비용이 발생합니다. 대부분의 기존 방법들은 계산 비용을 줄이기 위해 희소 프레임 샘플링을 통해 입력 길이를 줄이거나, 시공간 풀링(space-time pooling)을 통해 대형 언어 모델(LLM)에 전달되는 출력 시퀀스를 압축하는 등의 전략에 의존합니다. 그러나 이러한 단순한 접근 방식들은 중복 정보를 과도하게 표현하며, 종종 중요한 이벤트나 빠르게 발생하는 시공간 패턴을 놓치게 됩니다. 본 연구에서는 장기 비디오를 처리하기 위한 효율적인 상태 공간 모델(state-space model)인 BIMBA를 소개합니다. 우리의 모델은 선택적 스캔 알고리즘(selective scan algorithm)을 활용하여 고차원 비디오로부터 중요한 정보를 효과적으로 선택하고, 이를 효율적인 LLM 처리를 위한 축소된 토큰 시퀀스로 변환합니다. 광범위한 실험을 통해 BIMBA가 PerceptionTest, NExT-QA, EgoSchema, VNBench, LongVideoBench, Video-MME 등 여러 장기 VQA 벤치마크에서 최첨단 정확도를 달성함을 입증했습니다. 코드와 모델은 https://sites.google.com/view/bimba-mllm에서 공개되어 있습니다.

English

Video Question Answering (VQA) in long videos poses the key challenge of extracting relevant information and modeling long-range dependencies from many redundant frames. The self-attention mechanism provides a general solution for sequence modeling, but it has a prohibitive cost when applied to a massive number of spatiotemporal tokens in long videos. Most prior methods rely on compression strategies to lower the computational cost, such as reducing the input length via sparse frame sampling or compressing the output sequence passed to the large language model (LLM) via space-time pooling. However, these naive approaches over-represent redundant information and often miss salient events or fast-occurring space-time patterns. In this work, we introduce BIMBA, an efficient state-space model to handle long-form videos. Our model leverages the selective scan algorithm to learn to effectively select critical information from high-dimensional video and transform it into a reduced token sequence for efficient LLM processing. Extensive experiments demonstrate that BIMBA achieves state-of-the-art accuracy on multiple long-form VQA benchmarks, including PerceptionTest, NExT-QA, EgoSchema, VNBench, LongVideoBench, and Video-MME. Code, and models are publicly available at https://sites.google.com/view/bimba-mllm.

BIMBA: 장거리 비디오 질문 응답을 위한 선택적 스캔 압축 기법

BIMBA: Selective-Scan Compression for Long-Range Video Question Answering

초록

Support