VideoLLM-online: 스트리밍 비디오를 위한 온라인 비디오 대형 언어 모델

초록

최근 대형 언어 모델(Large Language Models)은 이미지, 비디오, 그리고 시각-언어 간의 복합 콘텐츠를 이해할 수 있는 시각 능력이 강화되었습니다. 그러나 이러한 대형 멀티모달 모델의 학습 방법은 일반적으로 비디오를 미리 정해진 클립으로 취급하기 때문에, 스트리밍 비디오 입력을 처리하는 데 있어 효율성과 효과가 떨어지는 경향이 있습니다. 본 논문에서는 연속적인 비디오 스트림 내에서 시간적으로 정렬된 장기 컨텍스트와 실시간 대화를 가능하게 하는 새로운 Learning-In-Video-Stream(LIVE) 프레임워크를 제안합니다. 우리의 LIVE 프레임워크는 비디오 스트리밍 대화를 달성하기 위한 포괄적인 접근 방식을 포함하며, 이는 다음과 같습니다: (1) 연속적인 스트리밍 입력에 대한 언어 모델링을 수행하도록 설계된 학습 목표, (2) 오프라인 시간적 주석을 스트리밍 대화 형식으로 변환하는 데이터 생성 기법, 그리고 (3) 실제 비디오 스트림에서 모델 응답 속도를 높이기 위한 최적화된 추론 파이프라인. 우리는 LIVE 프레임워크를 기반으로 Llama-2/Llama-3 위에 VideoLLM-online 모델을 구축하고, 스트리밍 비디오 처리에서의 상당한 이점을 입증했습니다. 예를 들어, 평균적으로 우리의 모델은 A100 GPU에서 5분 길이의 비디오 클립에 대해 10 FPS 이상의 속도로 스트리밍 대화를 지원할 수 있습니다. 또한, 인식, 캡셔닝, 예측과 같은 공개 오프라인 비디오 벤치마크에서도 최첨단 성능을 보여줍니다. 코드, 모델, 데이터, 데모는 https://showlab.github.io/videollm-online에서 확인할 수 있습니다.

English

Recent Large Language Models have been enhanced with vision capabilities, enabling them to comprehend images, videos, and interleaved vision-language content. However, the learning methods of these large multimodal models typically treat videos as predetermined clips, making them less effective and efficient at handling streaming video inputs. In this paper, we propose a novel Learning-In-Video-Stream (LIVE) framework, which enables temporally aligned, long-context, and real-time conversation within a continuous video stream. Our LIVE framework comprises comprehensive approaches to achieve video streaming dialogue, encompassing: (1) a training objective designed to perform language modeling for continuous streaming inputs, (2) a data generation scheme that converts offline temporal annotations into a streaming dialogue format, and (3) an optimized inference pipeline to speed up the model responses in real-world video streams. With our LIVE framework, we built VideoLLM-online model upon Llama-2/Llama-3 and demonstrate its significant advantages in processing streaming videos. For instance, on average, our model can support streaming dialogue in a 5-minute video clip at over 10 FPS on an A100 GPU. Moreover, it also showcases state-of-the-art performance on public offline video benchmarks, such as recognition, captioning, and forecasting. The code, model, data, and demo have been made available at https://showlab.github.io/videollm-online.

VideoLLM-online: 스트리밍 비디오를 위한 온라인 비디오 대형 언어 모델

VideoLLM-online: Online Video Large Language Model for Streaming Video

초록

Support