ViSpeak:流媒体视频中的视觉指令反馈系统
ViSpeak: Visual Instruction Feedback in Streaming Videos
March 17, 2025
作者: Shenghao Fu, Qize Yang, Yuan-Ming Li, Yi-Xing Peng, Kun-Yu Lin, Xihan Wei, Jian-Fang Hu, Xiaohua Xie, Wei-Shi Zheng
cs.AI
摘要
近期,大型多模态模型(LMMs)的进展主要集中在离线视频理解上。相比之下,流媒体视频理解因其时效性、全模态和交互性特征,对现有模型提出了巨大挑战。本研究旨在从新视角拓展流媒体视频理解,并提出了一项名为“视觉指令反馈”的新任务,要求模型能够感知视觉内容并从中提取指令。例如,当用户向智能体挥手时,智能体应识别该手势并启动对话,提供欢迎信息。因此,遵循视觉模态中的指令极大地增强了用户与智能体间的互动。为促进研究,我们定义了与视觉模态高度相关的七项关键子任务,并收集了用于训练的ViSpeak-Instruct数据集和用于评估的ViSpeak-Bench。此外,我们提出了ViSpeak模型,这是一款在多种流媒体视频理解基准测试中达到GPT-4o级别性能的顶尖流媒体视频理解LMM。经过在ViSpeak-Instruct数据集上的微调,ViSpeak具备了基本的视觉指令反馈能力,为未来研究奠定了坚实的基础。
English
Recent advances in Large Multi-modal Models (LMMs) are primarily focused on
offline video understanding. Instead, streaming video understanding poses great
challenges to recent models due to its time-sensitive, omni-modal and
interactive characteristics. In this work, we aim to extend the streaming video
understanding from a new perspective and propose a novel task named Visual
Instruction Feedback in which models should be aware of visual contents and
learn to extract instructions from them. For example, when users wave their
hands to agents, agents should recognize the gesture and start conversations
with welcome information. Thus, following instructions in visual modality
greatly enhances user-agent interactions. To facilitate research, we define
seven key subtasks highly relevant to visual modality and collect the
ViSpeak-Instruct dataset for training and the ViSpeak-Bench for evaluation.
Further, we propose the ViSpeak model, which is a SOTA streaming video
understanding LMM with GPT-4o-level performance on various streaming video
understanding benchmarks. After finetuning on our ViSpeak-Instruct dataset,
ViSpeak is equipped with basic visual instruction feedback ability, serving as
a solid baseline for future research.Summary
AI-Generated Summary