VideoMathQA: 비디오를 통한 다중모달 이해 기반 수학적 추론 벤치마킹

초록

실세계 비디오 환경에서의 수학적 추론은 정적 이미지나 텍스트와는 근본적으로 다른 도전 과제를 제시합니다. 이는 세밀한 시각 정보를 해석하고, 손글씨 또는 디지털 텍스트를 정확하게 읽으며, 시간에 걸쳐 비선형적으로 분산된 음성 단서를 통합하는 것을 요구합니다. 이러한 다중 모달 상황에서 성공은 단순히 인식에만 의존하는 것이 아니라, 풍부하고 잡음이 많은 콘텐츠 스트림에서 적절한 문맥적 세부 사항을 선택적으로 식별하고 통합하는 데 달려 있습니다. 이를 위해 우리는 모델이 비디오에서 이러한 시간적으로 확장된 교차 모달 추론을 수행할 수 있는지 평가하기 위해 VideoMathQA 벤치마크를 소개합니다. 이 벤치마크는 10가지 다양한 수학 영역을 포괄하며, 10초에서 1시간 이상의 비디오를 다룹니다. 이는 모델이 구조화된 시각 콘텐츠를 해석하고, 교육적 내러티브를 이해하며, 시각, 오디오, 텍스트 모달리티 간의 개념을 공동으로 기반으로 하는 것을 요구합니다. 우리는 고품질을 보장하기 위해 대학원 수준의 전문가를 고용하여 총 920시간 이상의 주석 작업을 수행했습니다. 실세계 시나리오를 반영하기 위해, 질문은 세 가지 핵심 추론 도전 과제를 중심으로 설계되었습니다: 제시된 질문에 기반한 답을 찾는 직접 문제 해결, 학습된 방법을 새로운 문제에 적용하는 개념적 전이, 그리고 확장된 설명과 부분적으로 해결된 솔루션에 대한 다단계 추론을 포함하는 깊은 교육적 이해입니다. 각 질문에는 다단계 추론 주석이 포함되어 있어 모델의 능력을 세밀하게 진단할 수 있습니다. 이 벤치마크를 통해 우리는 기존 접근법의 한계를 강조하고, 시간적으로 확장되고 모달리티가 풍부한 수학적 문제 설정에서 단순히 인식하는 것이 아니라 추론해야 하는 모델을 위한 체계적인 평가 프레임워크를 확립합니다. 우리의 벤치마크와 평가 코드는 https://mbzuai-oryx.github.io/VideoMathQA에서 확인할 수 있습니다.

English

Mathematical reasoning in real-world video settings presents a fundamentally different challenge than in static images or text. It requires interpreting fine-grained visual information, accurately reading handwritten or digital text, and integrating spoken cues, often dispersed non-linearly over time. In such multimodal contexts, success hinges not just on perception, but on selectively identifying and integrating the right contextual details from a rich and noisy stream of content. To this end, we introduce VideoMathQA, a benchmark designed to evaluate whether models can perform such temporally extended cross-modal reasoning on videos. The benchmark spans 10 diverse mathematical domains, covering videos ranging from 10 seconds to over 1 hour. It requires models to interpret structured visual content, understand instructional narratives, and jointly ground concepts across visual, audio, and textual modalities. We employ graduate-level experts to ensure high quality, totaling over 920 man-hours of annotation. To reflect real-world scenarios, questions are designed around three core reasoning challenges: direct problem solving, where answers are grounded in the presented question; conceptual transfer, which requires applying learned methods to new problems; and deep instructional comprehension, involving multi-step reasoning over extended explanations and partially worked-out solutions. Each question includes multi-step reasoning annotations, enabling fine-grained diagnosis of model capabilities. Through this benchmark, we highlight the limitations of existing approaches and establish a systematic evaluation framework for models that must reason, rather than merely perceive, across temporally extended and modality-rich mathematical problem settings. Our benchmark and evaluation code are available at: https://mbzuai-oryx.github.io/VideoMathQA

VideoMathQA: 비디오를 통한 다중모달 이해 기반 수학적 추론 벤치마킹

VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos

초록

Support