RIVER：面向视频大语言模型的实时交互基准评测体系

摘要

多模态大语言模型的快速发展已展现出令人瞩目的能力，但现有模型几乎均采用离线处理范式，这阻碍了实时交互的实现。为弥补这一空白，我们推出实时视频交互基准测试平台RIVER Bench，专门用于评估在线视频理解能力。该基准创新性地构建了由回溯记忆、实时感知与主动预测任务组成的评估框架，通过模拟交互式对话而非一次性处理完整视频的方式贴近实际应用场景。我们采用多源异构且时长各异的视频数据进行精细化标注，并明确定义了实时交互的数据格式。对不同类别模型的评估表明：离线模型在单次问答任务中表现优异，但难以胜任实时处理需求。针对现有模型在在线视频交互中存在的长期记忆与未来感知能力不足等缺陷，我们提出通用改进方法，使模型能够更灵活地实现实时人机交互。我们相信这项工作将显著推动实时交互式视频理解模型的发展，并为这一新兴领域的后续研究提供启发。数据集与代码已公开于https://github.com/OpenGVLab/RIVER。

English

The rapid advancement of multimodal large language models has demonstrated impressive capabilities, yet nearly all operate in an offline paradigm, hindering real-time interactivity. Addressing this gap, we introduce the Real-tIme Video intERaction Bench (RIVER Bench), designed for evaluating online video comprehension. RIVER Bench introduces a novel framework comprising Retrospective Memory, Live-Perception, and Proactive Anticipation tasks, closely mimicking interactive dialogues rather than responding to entire videos at once. We conducted detailed annotations using videos from diverse sources and varying lengths, and precisely defined the real-time interactive format. Evaluations across various model categories reveal that while offline models perform well in single question-answering tasks, they struggle with real-time processing. Addressing the limitations of existing models in online video interaction, especially their deficiencies in long-term memory and future perception, we proposed a general improvement method that enables models to interact with users more flexibly in real time. We believe this work will significantly advance the development of real-time interactive video understanding models and inspire future research in this emerging field. Datasets and code are publicly available at https://github.com/OpenGVLab/RIVER.

RIVER：面向视频大语言模型的实时交互基准评测体系

RIVER: A Real-Time Interaction Benchmark for Video LLMs

摘要

Support