VideoLLM-online：用于流媒体视频的在线视频大型语言模型

摘要

最近的大型语言模型已经增强了视觉能力，使它们能够理解图像、视频和交织的视觉-语言内容。然而，这些大型多模态模型的学习方法通常将视频视为预定的片段，使它们在处理流式视频输入时效果和效率较低。在本文中，我们提出了一种新颖的视频流学习（LIVE）框架，该框架能够在连续视频流中实现时间对齐、长上下文和实时对话。我们的LIVE框架包括全面的方法，以实现视频流对话，包括：（1）设计用于连续流输入的语言建模训练目标，（2）将离线时间标注转换为流式对话格式的数据生成方案，以及（3）优化的推理流程，以加速模型在现实世界视频流中的响应速度。通过我们的LIVE框架，我们基于Llama-2/Llama-3构建了VideoLLM-online模型，并展示了它在处理流式视频方面的显著优势。例如，平均而言，我们的模型可以在A100 GPU上以每秒超过10帧的速度支持5分钟视频片段的流式对话。此外，它还展示了在公共离线视频基准测试中的最新性能，如识别、字幕和预测。代码、模型、数据和演示可在https://showlab.github.io/videollm-online获取。

English

Recent Large Language Models have been enhanced with vision capabilities, enabling them to comprehend images, videos, and interleaved vision-language content. However, the learning methods of these large multimodal models typically treat videos as predetermined clips, making them less effective and efficient at handling streaming video inputs. In this paper, we propose a novel Learning-In-Video-Stream (LIVE) framework, which enables temporally aligned, long-context, and real-time conversation within a continuous video stream. Our LIVE framework comprises comprehensive approaches to achieve video streaming dialogue, encompassing: (1) a training objective designed to perform language modeling for continuous streaming inputs, (2) a data generation scheme that converts offline temporal annotations into a streaming dialogue format, and (3) an optimized inference pipeline to speed up the model responses in real-world video streams. With our LIVE framework, we built VideoLLM-online model upon Llama-2/Llama-3 and demonstrate its significant advantages in processing streaming videos. For instance, on average, our model can support streaming dialogue in a 5-minute video clip at over 10 FPS on an A100 GPU. Moreover, it also showcases state-of-the-art performance on public offline video benchmarks, such as recognition, captioning, and forecasting. The code, model, data, and demo have been made available at https://showlab.github.io/videollm-online.

VideoLLM-online：用于流媒体视频的在线视频大型语言模型

VideoLLM-online: Online Video Large Language Model for Streaming Video

摘要

Support