다중모드 LLM을 위한 토큰 효율적 장기 비디오 이해

초록

비디오 기반 다중 모드 대형 언어 모델(Video-LLMs)의 최근 발전은 비디오를 이미지 프레임 시퀀스로 처리함으로써 비디오 이해를 크게 향상시켰습니다. 그러나 기존의 많은 방법들은 비전 백본에서 프레임을 독립적으로 처리하며, 명시적인 시간적 모델링이 부족하여 동적 패턴을 포착하고 긴 비디오를 효율적으로 처리하는 능력이 제한됩니다. 이러한 한계를 해결하기 위해, 우리는 STORM(Spatiotemporal TOken Reduction for Multimodal LLMs)이라는 새로운 아키텍처를 소개합니다. 이 아키텍처는 이미지 인코더와 LLM 사이에 전용 시간적 인코더를 통합합니다. 우리의 시간적 인코더는 Mamba State Space Model을 활용하여 이미지 토큰에 시간적 정보를 통합하고, 전체 비디오 시퀀스에 걸쳐 프레임 간 동역학을 보존하는 풍부한 표현을 생성합니다. 이 풍부한 인코딩은 비디오 추론 능력을 향상시킬 뿐만 아니라, 테스트 시간 샘플링 및 훈련 기반 시간적 및 공간적 풀링을 포함한 효과적인 토큰 감소 전략을 가능하게 하여, 중요한 시간적 정보를 희생하지 않고도 LLM의 계산 요구를 크게 줄입니다. 이러한 기술들을 통합함으로써, 우리의 접근 방식은 훈련 및 추론 지연 시간을 줄이면서 성능을 향상시켜, 확장된 시간적 맥락에서 효율적이고 강력한 비디오 이해를 가능하게 합니다. 광범위한 평가 결과, STORM은 다양한 긴 비디오 이해 벤치마크(MLVU 및 LongVideoBench에서 5% 이상의 향상)에서 최첨단 결과를 달성하면서, 고정된 입력 프레임 수에 대해 계산 비용을 최대 8배, 디코딩 지연 시간을 2.4-2.9배 줄였습니다. 프로젝트 페이지는 https://research.nvidia.com/labs/lpr/storm에서 확인할 수 있습니다.

English

Recent advances in video-based multimodal large language models (Video-LLMs) have significantly improved video understanding by processing videos as sequences of image frames. However, many existing methods treat frames independently in the vision backbone, lacking explicit temporal modeling, which limits their ability to capture dynamic patterns and efficiently handle long videos. To address these limitations, we introduce STORM (Spatiotemporal TOken Reduction for Multimodal LLMs), a novel architecture incorporating a dedicated temporal encoder between the image encoder and the LLM. Our temporal encoder leverages the Mamba State Space Model to integrate temporal information into image tokens, generating enriched representations that preserve inter-frame dynamics across the entire video sequence. This enriched encoding not only enhances video reasoning capabilities but also enables effective token reduction strategies, including test-time sampling and training-based temporal and spatial pooling, substantially reducing computational demands on the LLM without sacrificing key temporal information. By integrating these techniques, our approach simultaneously reduces training and inference latency while improving performance, enabling efficient and robust video understanding over extended temporal contexts. Extensive evaluations show that STORM achieves state-of-the-art results across various long video understanding benchmarks (more than 5\% improvement on MLVU and LongVideoBench) while reducing the computation costs by up to 8times and the decoding latency by 2.4-2.9times for the fixed numbers of input frames. Project page is available at https://research.nvidia.com/labs/lpr/storm

다중모드 LLM을 위한 토큰 효율적 장기 비디오 이해

Token-Efficient Long Video Understanding for Multimodal LLMs

초록

Support