VideoLLM-online：用於串流視頻的在線視頻大型語言模型

摘要

最近的大型語言模型已經增強了視覺能力，使它們能夠理解圖像、視頻和交錯的視覺語言內容。然而，這些大型多模型模型的學習方法通常將視頻視為預定的片段，這使它們在處理流式視頻輸入時效果不佳且效率低下。在本文中，我們提出了一個新穎的「視頻流學習」（LIVE）框架，該框架使連續視頻流內實現了時間對齊、長上下文和實時對話。我們的LIVE框架包括全面的方法，以實現視頻流對話，包括：（1）一個旨在為連續流輸入執行語言建模的訓練目標，（2）一個數據生成方案，將離線時間標註轉換為流式對話格式，以及（3）一個優化的推理流程，以加快模型在現實世界視頻流中的響應速度。通過我們的LIVE框架，我們基於Llama-2/Llama-3構建了VideoLLM-online模型，並展示了它在處理流式視頻方面的顯著優勢。例如，我們的模型平均可以在A100 GPU上以超過10 FPS的速度支持5分鐘視頻片段中的流式對話。此外，它還展示了在公共離線視頻基準測試中的最新性能，如識別、字幕和預測。代碼、模型、數據和演示可在 https://showlab.github.io/videollm-online 上找到。

English

Recent Large Language Models have been enhanced with vision capabilities, enabling them to comprehend images, videos, and interleaved vision-language content. However, the learning methods of these large multimodal models typically treat videos as predetermined clips, making them less effective and efficient at handling streaming video inputs. In this paper, we propose a novel Learning-In-Video-Stream (LIVE) framework, which enables temporally aligned, long-context, and real-time conversation within a continuous video stream. Our LIVE framework comprises comprehensive approaches to achieve video streaming dialogue, encompassing: (1) a training objective designed to perform language modeling for continuous streaming inputs, (2) a data generation scheme that converts offline temporal annotations into a streaming dialogue format, and (3) an optimized inference pipeline to speed up the model responses in real-world video streams. With our LIVE framework, we built VideoLLM-online model upon Llama-2/Llama-3 and demonstrate its significant advantages in processing streaming videos. For instance, on average, our model can support streaming dialogue in a 5-minute video clip at over 10 FPS on an A100 GPU. Moreover, it also showcases state-of-the-art performance on public offline video benchmarks, such as recognition, captioning, and forecasting. The code, model, data, and demo have been made available at https://showlab.github.io/videollm-online.

VideoLLM-online：用於串流視頻的在線視頻大型語言模型

VideoLLM-online: Online Video Large Language Model for Streaming Video

摘要

Summary

Support

Support