VideoMathQA:基於視頻多模態理解的數學推理基準測試
VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos
June 5, 2025
作者: Hanoona Rasheed, Abdelrahman Shaker, Anqi Tang, Muhammad Maaz, Ming-Hsuan Yang, Salman Khan, Fahad Khan
cs.AI
摘要
在現實世界的視頻環境中進行數學推理,與靜態圖像或文本相比,呈現出根本性的不同挑戰。它需要解讀細粒度的視覺信息,準確閱讀手寫或數字化文本,並整合口語提示,這些信息往往在時間上非線性地分散。在這種多模態情境下,成功的關鍵不僅在於感知,更在於從豐富而嘈雜的內容流中選擇性地識別並整合正確的上下文細節。為此,我們引入了VideoMathQA,一個旨在評估模型是否能在視頻上執行此類時間延展的跨模態推理的基準。該基準涵蓋了10個多樣的數學領域,視頻時長從10秒到超過1小時不等。它要求模型解讀結構化的視覺內容,理解教學敘述,並在視覺、音頻和文本模態之間共同錨定概念。我們聘請研究生級別的專家以確保高質量,總計超過920人時的註釋工作。為了反映真實世界場景,問題設計圍繞三個核心推理挑戰:直接問題解決,其中答案基於提出的問題;概念遷移,要求將學習到的方法應用於新問題;以及深度教學理解,涉及對延長解釋和部分解決方案的多步推理。每個問題都包含多步推理註釋,使得能夠對模型能力進行細粒度診斷。通過這一基準,我們強調了現有方法的局限性,並為那些必須在時間延展且模態豐富的數學問題設置中進行推理而非僅僅感知的模型,建立了一個系統的評估框架。我們的基準和評估代碼可在以下網址獲取:https://mbzuai-oryx.github.io/VideoMathQA。
English
Mathematical reasoning in real-world video settings presents a fundamentally
different challenge than in static images or text. It requires interpreting
fine-grained visual information, accurately reading handwritten or digital
text, and integrating spoken cues, often dispersed non-linearly over time. In
such multimodal contexts, success hinges not just on perception, but on
selectively identifying and integrating the right contextual details from a
rich and noisy stream of content. To this end, we introduce VideoMathQA, a
benchmark designed to evaluate whether models can perform such temporally
extended cross-modal reasoning on videos. The benchmark spans 10 diverse
mathematical domains, covering videos ranging from 10 seconds to over 1 hour.
It requires models to interpret structured visual content, understand
instructional narratives, and jointly ground concepts across visual, audio, and
textual modalities. We employ graduate-level experts to ensure high quality,
totaling over 920 man-hours of annotation. To reflect real-world scenarios,
questions are designed around three core reasoning challenges: direct problem
solving, where answers are grounded in the presented question; conceptual
transfer, which requires applying learned methods to new problems; and deep
instructional comprehension, involving multi-step reasoning over extended
explanations and partially worked-out solutions. Each question includes
multi-step reasoning annotations, enabling fine-grained diagnosis of model
capabilities. Through this benchmark, we highlight the limitations of existing
approaches and establish a systematic evaluation framework for models that must
reason, rather than merely perceive, across temporally extended and
modality-rich mathematical problem settings. Our benchmark and evaluation code
are available at: https://mbzuai-oryx.github.io/VideoMathQA