Video-MME-Logical：映像時間論理推論のための制御診断ベンチマーク

要旨

近年、マルチモーダル大規模言語モデル（MLLM）への関心が高まる中、中心的な問いが浮上している。それは、これらのモデルが個々のフレーム内の物体や事象を単に認識するだけでなく、動的な視覚的証拠に基づいて推論できるかどうかである。本稿で「ビデオ時間論理推論」と呼ぶこの能力には、視覚状態がフレーム間で変化するにつれて、証拠を維持・更新・構成することが求められる。既存のビデオベンチマークは、この能力をシーンの複雑さ、静的な認識、あるいは制御されていない時間的変動と混同することが多い。そこで本稿では、この能力を単離するために、状態追跡、逐次カウンティング、時間順序付け、動的空間性、構造的構成という5つの時間論理的操作に基づいて構成された制御可能なベンチマーク「Video-MME-Logical」を提案する。本ベンチマークは、制御されたオブジェクト状態、遷移、時間的依存関係、論理的構成を用いて生成された25の細粒度タスクカテゴリを含む。時間的範囲と推論の複雑さを変化させることで、難易度を制御した最終解答評価を可能にし、モデルが最終解答を生成する前に必要な論理的推論の軌跡を回復するかどうかを検証することで、中間状態の診断も支援する。最先端のMLLMによる実験では、特に時間論理的な複雑さが増すにつれて、人間とモデルの間に大きなギャップが明らかになった。最大50万の生成サンプルを用いた教師ありファインチューニングは性能を向上させるものの、推論ギャップを埋めるには不十分であり、Video-MME-LogicalはMLLMにおける時間論理推論を分析・改善するためのスケーラブルなテストベッドとして位置づけられる。

English

Recent interest in multimodal large language models (MLLMs) raises a central question: can they reason over dynamic visual evidence rather than merely recognize objects or events in individual frames? This ability, which we refer to as video temporal-logical reasoning, requires models to maintain, update, and compose evidence as visual states evolve across frames. Existing video benchmarks often conflate this capability with scene complexity, static recognition, or uncontrolled temporal variation. To isolate this capability, we introduce Video-MME-Logical, a controlled benchmark organized around five temporal-logical operations: state tracking, sequential counting, temporal ordering, dynamic spatiality, and structural composition. The benchmark contains 25 fine-grained task categories generated with controlled object states, transitions, temporal dependencies, and logical compositions. It enables difficulty-controlled final-answer evaluation by varying temporal horizon and reasoning complexity, and supports intermediate-state diagnostics by verifying whether models recover the required logical reasoning trace before producing the final answer. Experiments with state-of-the-art MLLMs reveal a substantial human-model gap, especially as temporal-logical complexity increases. Supervised fine-tuning on up to 500K generated samples improves performance but remains insufficient to close the reasoning gap, positioning Video-MME-Logical as a scalable testbed for analyzing and improving temporal-logical reasoning in MLLMs.