RIVER: 映像大規模言語モデルのためのリアルタイム対話ベンチマーク

要旨

マルチモーダル大規模言語モデルの急速な進歩は印象的な能力を示しているが、そのほとんどがオフラインのパラダイムで動作しており、リアルタイム相互運用性を妨げている。この課題に対処するため、我々はオンライン映像理解を評価するために設計されたReal-tIme Video intERaction Bench（RIVER Bench）を提案する。RIVER Benchは、回顧的記憶、現実知覚、能動的予測のタスクから構成される新規フレームワークを導入し、映像全体への一括応答ではなく、対話型コミュニケーションを精密に模倣する。多様なソースと長さの映像を用いて詳細なアノテーションを行い、リアルタイム対話形式を精確に定義した。様々なモデルカテゴリにおける評価により、オフラインモデルは単一の質問応答タスクでは良好な性能を示すものの、リアルタイム処理には課題を抱えることが明らかとなった。既存モデルのオンライン映像対話における限界、特に長期記憶と未来知覚の欠如に対処するため、モデルがより柔軟にリアルタイムでユーザーと対話できる汎用的な改善手法を提案した。本研究成果が、リアルタイム対話型映像理解モデルの発展を大きく推進し、この新興分野における将来研究に刺激を与えると確信している。データセットとコードはhttps://github.com/OpenGVLab/RIVER で公開されている。

English

The rapid advancement of multimodal large language models has demonstrated impressive capabilities, yet nearly all operate in an offline paradigm, hindering real-time interactivity. Addressing this gap, we introduce the Real-tIme Video intERaction Bench (RIVER Bench), designed for evaluating online video comprehension. RIVER Bench introduces a novel framework comprising Retrospective Memory, Live-Perception, and Proactive Anticipation tasks, closely mimicking interactive dialogues rather than responding to entire videos at once. We conducted detailed annotations using videos from diverse sources and varying lengths, and precisely defined the real-time interactive format. Evaluations across various model categories reveal that while offline models perform well in single question-answering tasks, they struggle with real-time processing. Addressing the limitations of existing models in online video interaction, especially their deficiencies in long-term memory and future perception, we proposed a general improvement method that enables models to interact with users more flexibly in real time. We believe this work will significantly advance the development of real-time interactive video understanding models and inspire future research in this emerging field. Datasets and code are publicly available at https://github.com/OpenGVLab/RIVER.

RIVER: 映像大規模言語モデルのためのリアルタイム対話ベンチマーク

RIVER: A Real-Time Interaction Benchmark for Video LLMs

要旨

Support