ChatPaper.aiChatPaper

BMMR:大規模雙語多模態跨學科推理數據集

BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset

July 4, 2025
作者: Zhiheng Xi, Guanyu Li, Yutao Fan, Honglin Guo, Yufang Liu, Xiaoran Fan, Jiaqi Liu, Jingchao Ding, Wangmeng Zuo, Zhenfei Yin, Lei Bai, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang
cs.AI

摘要

本文介紹了BMMR,這是一個大規模的雙語、多模態、多學科推理數據集,旨在為社區開發和評估大型多模態模型(LMMs)提供支持。BMMR包含110,000道大學水平的問題,涵蓋300個由聯合國教科文組織定義的學科,並以多種形式呈現——包括選擇題、填空題和開放式問答——數據來源於印刷和數字媒體,如書籍、考試和測驗。所有數據均通過人機協作的可擴展框架進行篩選和整理,每個實例均配備高質量的推理路徑。該數據集分為兩部分:BMMR-Eval包含20,458個高質量實例,用於全面評估LMMs在中文和英文跨多學科的知識和推理能力;BMMR-Train則包含88,991個實例,以支持進一步的研究和開發,將當前對數學推理的關注擴展到多樣化的學科和領域。此外,我們提出了基於過程的多學科驗證器(即BMMR-Verifier),用於對推理路徑進行精確且細粒度的評估。在24個模型上的廣泛實驗表明:(i)即使是SOTA模型(如o3和Gemini-2.5-Pro)在BMMR-Eval上仍有顯著的提升空間;(ii)推理模型表現出學科偏見,僅在特定學科上優於LMMs;(iii)開源模型仍落後於其專有對應模型;(iv)在BMMR-Train上進行微調可縮小這一差距。此外,我們使用BMMR-Verifier進行了推理鏈分析及其他深入研究,揭示了LMMs目前在多學科推理方面面臨的挑戰。我們將公開數據,並希望我們的工作能為社區提供洞見和貢獻。
English
In this paper, we introduce BMMR, a large-scale bilingual, multimodal, multi-disciplinary reasoning dataset for the community to develop and evaluate large multimodal models (LMMs). BMMR comprises 110k college-level questions spanning 300 UNESCO-defined subjects, spanning diverse formats-multiple-choice, fill-in-the-blank, and open-ended QA-and sourced from both print and digital media such as books, exams, and quizzes. All data are curated and filtered via a human-in-the-loop and scalable framework, and each instance is paired with a high-quality reasoning path. The dataset is organized into two parts: BMMR-Eval that comprises 20,458 high-quality instances to comprehensively assess LMMs' knowledge and reasoning across multiple disciplines in both Chinese and English; and BMMR-Train that contains 88,991 instances to support further research and development, extending the current focus on mathematical reasoning to diverse disciplines and domains. In addition, we propose the process-based multi-discipline verifier (i.e., BMMR-Verifier) for accurate and fine-grained evaluation of reasoning paths. Extensive experiments on 24 models reveal that (i) even SOTA models (e.g., o3 and Gemini-2.5-Pro) leave substantial headroom on BMMR-Eval; (ii) reasoning models exhibit discipline bias and outperform LMMs only on specific subjects; (iii) open-source models still trail their proprietary counterparts; and (iv) fine-tuning on BMMR-Train narrows this gap. Additionally, we conduct reasoning-chain analyses using BMMR-Verifier and other in-depth studies, uncovering the challenges LMMs currently face in multidisciplinary reasoning. We will release the data, and we hope our work can offer insights and contributions to the community.
PDF201July 8, 2025