비디오LLaMB: 재귀 메모리를 활용한 장기 문맥 비디오 이해

초록

최근 대규모 비디오-언어 모델의 발전은 실시간 계획 및 상세한 상호작용에 상당한 잠재력을 보여주었습니다. 그러나 그들의 높은 계산 요구와 주석이 달린 데이터셋의 부족으로 인해 학술 연구자들에게 실용성이 제한됩니다. 본 연구에서는 VideoLLaMB를 소개합니다. 이는 전체 비디오 시퀀스를 인코딩하기 위해 다리 레이어 내에서 시간적 메모리 토큰을 활용하는 혁신적인 프레임워크로, 역사적 시각 데이터와 함께 비디오 시퀀스를 효과적으로 인코딩하여 의미 연속성을 유지하고 다양한 작업에서 모델 성능을 향상시킵니다. 이 접근 방식에는 반복 메모리 토큰과 SceneTilling 알고리즘이 포함되어 있으며, 비디오를 독립적인 의미 단위로 분할하여 의미 무결성을 보존합니다. 경험적으로, VideoLLaMB는 기존 비디오-언어 모델을 크게 능가하여, 세 가지 VideoQA 벤치마크에서 경쟁 모델 대비 5.5 포인트 향상을 보여주며, 자아 중심적인 계획에서 2.06 포인트를 달성합니다. MVBench의 포괄적인 결과는 VideoLLaMB-7B가 이전의 동일한 LLM 7B 모델보다 훨씬 우수한 결과를 달성한다는 것을 보여줍니다. 놀랍게도, 비디오 길이가 최대 8배 증가함에도 VideoLLaMB는 PLLaVA와 같이 견고한 성능을 유지합니다. 또한, 저희의 특화된 '비디오 속 바늘 찾기' (NIAVH) 벤치마크에서의 프레임 검색 결과는 긴 비디오 내에서 특정 프레임을 정확하게 식별하는 VideoLLaMB의 능력을 더욱 검증합니다. 또한, SceneTilling 알고리즘은 추가적인 교육이 필요하지 않고 직접 스트리밍 비디오 자막을 생성할 수 있습니다. 효율성 측면에서, 16 프레임으로 훈련된 VideoLLaMB는 선형 GPU 메모리 스케일링을 통해 단일 Nvidia A100 GPU에서 최대 320 프레임을 지원하여 높은 성능과 비용 효율성을 모두 보장하며, 학술 및 실용적 응용 프로그램에서 장형 비디오-언어 모델을 위한 새로운 기반을 마련합니다.

English

Recent advancements in large-scale video-language models have shown significant potential for real-time planning and detailed interactions. However, their high computational demands and the scarcity of annotated datasets limit their practicality for academic researchers. In this work, we introduce VideoLLaMB, a novel framework that utilizes temporal memory tokens within bridge layers to allow for the encoding of entire video sequences alongside historical visual data, effectively preserving semantic continuity and enhancing model performance across various tasks. This approach includes recurrent memory tokens and a SceneTilling algorithm, which segments videos into independent semantic units to preserve semantic integrity. Empirically, VideoLLaMB significantly outstrips existing video-language models, demonstrating a 5.5 points improvement over its competitors across three VideoQA benchmarks, and 2.06 points on egocentric planning. Comprehensive results on the MVBench show that VideoLLaMB-7B achieves markedly better results than previous 7B models of same LLM. Remarkably, it maintains robust performance as PLLaVA even as video length increases up to 8 times. Besides, the frame retrieval results on our specialized Needle in a Video Haystack (NIAVH) benchmark, further validate VideoLLaMB's prowess in accurately identifying specific frames within lengthy videos. Our SceneTilling algorithm also enables the generation of streaming video captions directly, without necessitating additional training. In terms of efficiency, VideoLLaMB, trained on 16 frames, supports up to 320 frames on a single Nvidia A100 GPU with linear GPU memory scaling, ensuring both high performance and cost-effectiveness, thereby setting a new foundation for long-form video-language models in both academic and practical applications.

비디오LLaMB: 재귀 메모리를 활용한 장기 문맥 비디오 이해

VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges

초록

Support