BMMR:大规模双语多模态跨学科推理数据集
BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset
July 4, 2025
作者: Zhiheng Xi, Guanyu Li, Yutao Fan, Honglin Guo, Yufang Liu, Xiaoran Fan, Jiaqi Liu, Jingchao Ding, Wangmeng Zuo, Zhenfei Yin, Lei Bai, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang
cs.AI
摘要
本文介绍了BMMR,一个大规模的双语、多模态、跨学科推理数据集,旨在为社区开发和评估大型多模态模型(LMMs)提供支持。BMMR包含11万道大学水平的问题,涵盖300个联合国教科文组织定义的学科,问题形式多样——包括选择题、填空题和开放式问答——数据来源广泛,如书籍、考试和在线测验。所有数据均通过人机协作的可扩展框架进行筛选和整理,每个实例均配有高质量推理路径。该数据集分为两部分:BMMR-Eval包含20,458个高质量实例,用于全面评估LMMs在中文和英文环境下跨学科的知识与推理能力;BMMR-Train则包含88,991个实例,支持进一步的研究与开发,将当前数学推理的关注点扩展至多学科领域。此外,我们提出了基于过程的多学科验证器(即BMMR-Verifier),用于精确且细粒度地评估推理路径。在24个模型上的广泛实验表明:(i)即使是SOTA模型(如o3和Gemini-2.5-Pro)在BMMR-Eval上仍有显著提升空间;(ii)推理模型存在学科偏见,仅在特定科目上优于LMMs;(iii)开源模型仍落后于其专有版本;(iv)在BMMR-Train上进行微调可缩小这一差距。此外,我们利用BMMR-Verifier进行推理链分析及其他深入研究,揭示了LMMs当前在多学科推理中面临的挑战。我们将公开数据集,并希望我们的工作能为社区提供洞见与贡献。
English
In this paper, we introduce BMMR, a large-scale bilingual, multimodal,
multi-disciplinary reasoning dataset for the community to develop and evaluate
large multimodal models (LMMs). BMMR comprises 110k college-level questions
spanning 300 UNESCO-defined subjects, spanning diverse formats-multiple-choice,
fill-in-the-blank, and open-ended QA-and sourced from both print and digital
media such as books, exams, and quizzes. All data are curated and filtered via
a human-in-the-loop and scalable framework, and each instance is paired with a
high-quality reasoning path. The dataset is organized into two parts: BMMR-Eval
that comprises 20,458 high-quality instances to comprehensively assess LMMs'
knowledge and reasoning across multiple disciplines in both Chinese and
English; and BMMR-Train that contains 88,991 instances to support further
research and development, extending the current focus on mathematical reasoning
to diverse disciplines and domains. In addition, we propose the process-based
multi-discipline verifier (i.e., BMMR-Verifier) for accurate and fine-grained
evaluation of reasoning paths. Extensive experiments on 24 models reveal that
(i) even SOTA models (e.g., o3 and Gemini-2.5-Pro) leave substantial headroom
on BMMR-Eval; (ii) reasoning models exhibit discipline bias and outperform LMMs
only on specific subjects; (iii) open-source models still trail their
proprietary counterparts; and (iv) fine-tuning on BMMR-Train narrows this gap.
Additionally, we conduct reasoning-chain analyses using BMMR-Verifier and other
in-depth studies, uncovering the challenges LMMs currently face in
multidisciplinary reasoning. We will release the data, and we hope our work can
offer insights and contributions to the community.