VideoMathQA：通过视频多模态理解评估数学推理能力

摘要

现实世界视频场景中的数学推理提出了与静态图像或文本截然不同的挑战。它要求解析细粒度的视觉信息、准确读取手写或数字文本，并整合分散在时间线上非线性分布的语音线索。在此类多模态情境下，成功不仅依赖于感知能力，更在于从丰富且嘈杂的内容流中有选择地识别并整合正确的上下文细节。为此，我们推出了VideoMathQA这一基准测试，旨在评估模型是否能在视频中执行这种跨模态的长时间推理。该基准涵盖10个不同的数学领域，视频时长从10秒到超过1小时不等，要求模型解析结构化的视觉内容、理解教学叙述，并在视觉、音频和文本模态间联合锚定概念。我们聘请了研究生级别的专家进行高质量标注，总计超过920人时。为了反映真实世界场景，问题设计围绕三大核心推理挑战：直接问题求解，即答案基于所提问题；概念迁移，要求将已学方法应用于新问题；以及深度教学理解，涉及对扩展解释和部分解答的多步骤推理。每个问题均包含多步骤推理标注，便于对模型能力进行细粒度诊断。通过这一基准，我们揭示了现有方法的局限性，并为那些必须在时间延伸且模态丰富的数学问题场景中进行推理（而非仅感知）的模型建立了一个系统化的评估框架。我们的基准及评估代码已公开于：https://mbzuai-oryx.github.io/VideoMathQA。

English

Mathematical reasoning in real-world video settings presents a fundamentally different challenge than in static images or text. It requires interpreting fine-grained visual information, accurately reading handwritten or digital text, and integrating spoken cues, often dispersed non-linearly over time. In such multimodal contexts, success hinges not just on perception, but on selectively identifying and integrating the right contextual details from a rich and noisy stream of content. To this end, we introduce VideoMathQA, a benchmark designed to evaluate whether models can perform such temporally extended cross-modal reasoning on videos. The benchmark spans 10 diverse mathematical domains, covering videos ranging from 10 seconds to over 1 hour. It requires models to interpret structured visual content, understand instructional narratives, and jointly ground concepts across visual, audio, and textual modalities. We employ graduate-level experts to ensure high quality, totaling over 920 man-hours of annotation. To reflect real-world scenarios, questions are designed around three core reasoning challenges: direct problem solving, where answers are grounded in the presented question; conceptual transfer, which requires applying learned methods to new problems; and deep instructional comprehension, involving multi-step reasoning over extended explanations and partially worked-out solutions. Each question includes multi-step reasoning annotations, enabling fine-grained diagnosis of model capabilities. Through this benchmark, we highlight the limitations of existing approaches and establish a systematic evaluation framework for models that must reason, rather than merely perceive, across temporally extended and modality-rich mathematical problem settings. Our benchmark and evaluation code are available at: https://mbzuai-oryx.github.io/VideoMathQA

VideoMathQA：通过视频多模态理解评估数学推理能力

VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos

摘要

Support