VideoMathQA: ビデオにおけるマルチモーダル理解を通じた数学的推論のベンチマーキング

要旨

現実世界のビデオ設定における数学的推論は、静止画像やテキストとは根本的に異なる課題を提示します。これには、細かな視覚情報の解釈、手書きまたはデジタルテキストの正確な読み取り、そして時間的に非線形に分散された音声の手がかりの統合が必要です。このようなマルチモーダルな文脈では、成功は単なる知覚だけでなく、豊かでノイズの多いコンテンツの流れから適切な文脈の詳細を選択的に識別し統合することにかかっています。この目的のために、私たちはVideoMathQAを紹介します。これは、モデルがビデオ上で時間的に拡張されたクロスモーダル推論を実行できるかどうかを評価するためのベンチマークです。このベンチマークは10の多様な数学的領域にまたがり、10秒から1時間以上のビデオをカバーします。モデルには、構造化された視覚コンテンツの解釈、指導的なナラティブの理解、そして視覚、音声、テキストのモダリティにわたる概念の共同的な基盤付けが求められます。私たちは大学院レベルの専門家を採用し、合計920人時間以上のアノテーションを確保しました。現実世界のシナリオを反映するために、質問は3つの核心的な推論課題を中心に設計されています：提示された質問に基づいた直接的な問題解決、学んだ方法を新しい問題に適用する必要がある概念的転移、そして拡張された説明と部分的に解決された解決策にわたる多段階の推論を含む深い指導的理解です。各質問には多段階の推論アノテーションが含まれており、モデルの能力の細かな診断を可能にします。このベンチマークを通じて、既存のアプローチの限界を強調し、時間的に拡張されモダリティが豊富な数学的問題設定において、単に知覚するだけでなく推論する必要があるモデルのための体系的な評価フレームワークを確立します。私たちのベンチマークと評価コードは以下で利用可能です：https://mbzuai-oryx.github.io/VideoMathQA

English

Mathematical reasoning in real-world video settings presents a fundamentally different challenge than in static images or text. It requires interpreting fine-grained visual information, accurately reading handwritten or digital text, and integrating spoken cues, often dispersed non-linearly over time. In such multimodal contexts, success hinges not just on perception, but on selectively identifying and integrating the right contextual details from a rich and noisy stream of content. To this end, we introduce VideoMathQA, a benchmark designed to evaluate whether models can perform such temporally extended cross-modal reasoning on videos. The benchmark spans 10 diverse mathematical domains, covering videos ranging from 10 seconds to over 1 hour. It requires models to interpret structured visual content, understand instructional narratives, and jointly ground concepts across visual, audio, and textual modalities. We employ graduate-level experts to ensure high quality, totaling over 920 man-hours of annotation. To reflect real-world scenarios, questions are designed around three core reasoning challenges: direct problem solving, where answers are grounded in the presented question; conceptual transfer, which requires applying learned methods to new problems; and deep instructional comprehension, involving multi-step reasoning over extended explanations and partially worked-out solutions. Each question includes multi-step reasoning annotations, enabling fine-grained diagnosis of model capabilities. Through this benchmark, we highlight the limitations of existing approaches and establish a systematic evaluation framework for models that must reason, rather than merely perceive, across temporally extended and modality-rich mathematical problem settings. Our benchmark and evaluation code are available at: https://mbzuai-oryx.github.io/VideoMathQA

VideoMathQA: ビデオにおけるマルチモーダル理解を通じた数学的推論のベンチマーキング

VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos

要旨

Support