StreamBridge:将您的离线视频大语言模型转变为主动式流媒体助手
StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant
May 8, 2025
作者: Haibo Wang, Bo Feng, Zhengfeng Lai, Mingze Xu, Shiyu Li, Weifeng Ge, Afshin Dehghan, Meng Cao, Ping Huang
cs.AI
摘要
我们提出了StreamBridge,一个简洁而高效的框架,能够无缝地将离线视频大语言模型(Video-LLMs)转化为具备流式处理能力的模型。该框架解决了现有模型适应在线场景时的两大核心挑战:(1)多轮实时理解能力的局限,以及(2)主动响应机制的缺失。具体而言,StreamBridge整合了(1)结合轮次衰减压缩策略的记忆缓冲区,以支持长上下文的多轮交互,以及(2)一个解耦的轻量级激活模型,可轻松集成到现有Video-LLMs中,实现持续的主动响应。为了进一步支撑StreamBridge,我们构建了Stream-IT,一个专为流式视频理解设计的大规模数据集,其特点在于交错的视频-文本序列和多样化的指令格式。大量实验表明,StreamBridge显著提升了离线Video-LLMs在多种任务中的流式理解能力,甚至超越了如GPT-4o和Gemini 1.5 Pro等专有模型。同时,它在标准视频理解基准测试中也取得了具有竞争力或更优的表现。
English
We present StreamBridge, a simple yet effective framework that seamlessly
transforms offline Video-LLMs into streaming-capable models. It addresses two
fundamental challenges in adapting existing models into online scenarios: (1)
limited capability for multi-turn real-time understanding, and (2) lack of
proactive response mechanisms. Specifically, StreamBridge incorporates (1) a
memory buffer combined with a round-decayed compression strategy, supporting
long-context multi-turn interactions, and (2) a decoupled, lightweight
activation model that can be effortlessly integrated into existing Video-LLMs,
enabling continuous proactive responses. To further support StreamBridge, we
construct Stream-IT, a large-scale dataset tailored for streaming video
understanding, featuring interleaved video-text sequences and diverse
instruction formats. Extensive experiments show that StreamBridge significantly
improves the streaming understanding capabilities of offline Video-LLMs across
various tasks, outperforming even proprietary models such as GPT-4o and Gemini
1.5 Pro. Simultaneously, it achieves competitive or superior performance on
standard video understanding benchmarks.Summary
AI-Generated Summary