流媒体视频指令调校
Streaming Video Instruction Tuning
December 24, 2025
作者: Jiaer Xia, Peixian Chen, Mengdan Zhang, Xing Sun, Kaiyang Zhou
cs.AI
摘要
我们推出Streamo——一款作为通用交互助手的实时流媒体视频大语言模型。与现有专注于问答或字幕生成等单一功能的在线视频模型不同,Streamo能够执行广泛的流媒体视频任务,包括实时旁白解说、动作理解、事件描述、时序事件定位以及时效性问答。为实现这种多功能性,我们构建了Streamo-Instruct-465K,这是一个专为流媒体视频理解定制的大规模指令遵循数据集。该数据集涵盖多样化时序语境和多任务监督机制,支持异构流媒体任务的统一训练。通过端到端的指令遵循训练流程,Streamo在各类流媒体基准测试中展现出强大的时序推理能力、实时交互响应能力以及广泛的泛化性能。大量实验表明,Streamo成功弥合了离线视频感知模型与实时多模态助手之间的鸿沟,为实现在连续视频流中实现统一智能视频理解迈出重要一步。
English
We present Streamo, a real-time streaming video LLM that serves as a general-purpose interactive assistant. Unlike existing online video models that focus narrowly on question answering or captioning, Streamo performs a broad spectrum of streaming video tasks, including real-time narration, action understanding, event captioning, temporal event grounding, and time-sensitive question answering. To develop such versatility, we construct Streamo-Instruct-465K, a large-scale instruction-following dataset tailored for streaming video understanding. The dataset covers diverse temporal contexts and multi-task supervision, enabling unified training across heterogeneous streaming tasks. After training end-to-end on the instruction-following dataset through a streamlined pipeline, Streamo exhibits strong temporal reasoning, responsive interaction, and broad generalization across a variety of streaming benchmarks. Extensive experiments show that Streamo bridges the gap between offline video perception models and real-time multimodal assistants, making a step toward unified, intelligent video understanding in continuous video streams.