VRBench：長編ナラティブ動画における多段階推論のベンチマーク

要旨

我々はVRBenchを提案する。これは、大規模モデルの多段階推論能力を評価するために設計された初の長編物語動画ベンチマークであり、時間的推論と手続き的妥当性を見落としている既存の評価の限界に対処するものである。VRBenchは1,010本の長編動画（平均再生時間1.6時間）と、9,468組の人手によるラベル付き多段階質問応答ペア、30,292のタイムスタンプ付き推論ステップで構成されている。これらの動画は、プロットの一貫性を優先するため、専門家による相互評価を含む多段階フィルタリングプロセスを経て厳選されている。我々は、時間的に根拠のある複数のステップを必要とする一貫した推論連鎖を生成する人間-AI協調フレームワークを開発し、7つのタイプ（例：イベント帰属、暗黙的推論）にまたがる推論を可能にした。VRBenchは、結果レベルとプロセスレベルの両方でモデルを評価する多段階評価パイプラインを設計している。最終結果のための多肢選択問題（MCQ）に加えて、推論連鎖の質を多角的に包括的に評価するための進捗レベルLLMガイド型スコアリング指標を提案する。VRBenchを用いて12のLLMと16のVLMを広範に評価し、徹底的な分析を行い、多段階推論の分野を前進させる貴重な知見を提供する。

English

We present VRBench, the first long narrative video benchmark crafted for evaluating large models' multi-step reasoning capabilities, addressing limitations in existing evaluations that overlook temporal reasoning and procedural validity. It comprises 1,010 long videos (with an average duration of 1.6 hours), along with 9,468 human-labeled multi-step question-answering pairs and 30,292 reasoning steps with timestamps. These videos are curated via a multi-stage filtering process including expert inter-rater reviewing to prioritize plot coherence. We develop a human-AI collaborative framework that generates coherent reasoning chains, each requiring multiple temporally grounded steps, spanning seven types (e.g., event attribution, implicit inference). VRBench designs a multi-phase evaluation pipeline that assesses models at both the outcome and process levels. Apart from the MCQs for the final results, we propose a progress-level LLM-guided scoring metric to evaluate the quality of the reasoning chain from multiple dimensions comprehensively. Through extensive evaluations of 12 LLMs and 16 VLMs on VRBench, we undertake a thorough analysis and provide valuable insights that advance the field of multi-step reasoning.

VRBench：長編ナラティブ動画における多段階推論のベンチマーク

VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos

要旨

Support