RIVER：面向视频大语言模型的实时交互基准

摘要

随着多模态大语言模型的快速发展，其卓越能力已得到充分展现，但现有模型几乎均采用离线处理模式，这限制了实时交互的可能性。为弥补这一空白，我们推出了实时视频交互基准测试平台RIVER Bench，专门用于评估在线视频理解能力。该平台创新性地构建了包含回溯记忆、实时感知与前瞻预测任务的评估框架，通过模拟渐进式交互对话而非一次性整段视频应答，更贴近真实交互场景。我们采用多源异构时长视频进行精细标注，并明确定义了实时交互格式。对不同类别模型的评估表明：离线模型在单轮问答任务中表现优异，但实时处理能力明显不足。针对现有模型在在线视频交互中的局限性（尤其是长时记忆与未来感知缺陷），我们提出了一种通用改进方法，使模型能够更灵活地实现实时人机交互。我们相信这项工作将有力推动实时交互式视频理解模型的发展，并为这一新兴领域的后续研究提供重要启示。数据集与代码已开源：https://github.com/OpenGVLab/RIVER。

English

The rapid advancement of multimodal large language models has demonstrated impressive capabilities, yet nearly all operate in an offline paradigm, hindering real-time interactivity. Addressing this gap, we introduce the Real-tIme Video intERaction Bench (RIVER Bench), designed for evaluating online video comprehension. RIVER Bench introduces a novel framework comprising Retrospective Memory, Live-Perception, and Proactive Anticipation tasks, closely mimicking interactive dialogues rather than responding to entire videos at once. We conducted detailed annotations using videos from diverse sources and varying lengths, and precisely defined the real-time interactive format. Evaluations across various model categories reveal that while offline models perform well in single question-answering tasks, they struggle with real-time processing. Addressing the limitations of existing models in online video interaction, especially their deficiencies in long-term memory and future perception, we proposed a general improvement method that enables models to interact with users more flexibly in real time. We believe this work will significantly advance the development of real-time interactive video understanding models and inspire future research in this emerging field. Datasets and code are publicly available at https://github.com/OpenGVLab/RIVER.