BMMR：大规模双语多模态跨学科推理数据集

摘要

本文介绍了BMMR，一个大规模的双语、多模态、跨学科推理数据集，旨在为社区开发和评估大型多模态模型（LMMs）提供支持。BMMR包含11万道大学水平的问题，涵盖300个联合国教科文组织定义的学科，问题形式多样——包括选择题、填空题和开放式问答——数据来源广泛，如书籍、考试和在线测验。所有数据均通过人机协作的可扩展框架进行筛选和整理，每个实例均配有高质量推理路径。该数据集分为两部分：BMMR-Eval包含20,458个高质量实例，用于全面评估LMMs在中文和英文环境下跨学科的知识与推理能力；BMMR-Train则包含88,991个实例，支持进一步的研究与开发，将当前数学推理的关注点扩展至多学科领域。此外，我们提出了基于过程的多学科验证器（即BMMR-Verifier），用于精确且细粒度地评估推理路径。在24个模型上的广泛实验表明：（i）即使是SOTA模型（如o3和Gemini-2.5-Pro）在BMMR-Eval上仍有显著提升空间；（ii）推理模型存在学科偏见，仅在特定科目上优于LMMs；（iii）开源模型仍落后于其专有版本；（iv）在BMMR-Train上进行微调可缩小这一差距。此外，我们利用BMMR-Verifier进行推理链分析及其他深入研究，揭示了LMMs当前在多学科推理中面临的挑战。我们将公开数据集，并希望我们的工作能为社区提供洞见与贡献。

English

In this paper, we introduce BMMR, a large-scale bilingual, multimodal, multi-disciplinary reasoning dataset for the community to develop and evaluate large multimodal models (LMMs). BMMR comprises 110k college-level questions spanning 300 UNESCO-defined subjects, spanning diverse formats-multiple-choice, fill-in-the-blank, and open-ended QA-and sourced from both print and digital media such as books, exams, and quizzes. All data are curated and filtered via a human-in-the-loop and scalable framework, and each instance is paired with a high-quality reasoning path. The dataset is organized into two parts: BMMR-Eval that comprises 20,458 high-quality instances to comprehensively assess LMMs' knowledge and reasoning across multiple disciplines in both Chinese and English; and BMMR-Train that contains 88,991 instances to support further research and development, extending the current focus on mathematical reasoning to diverse disciplines and domains. In addition, we propose the process-based multi-discipline verifier (i.e., BMMR-Verifier) for accurate and fine-grained evaluation of reasoning paths. Extensive experiments on 24 models reveal that (i) even SOTA models (e.g., o3 and Gemini-2.5-Pro) leave substantial headroom on BMMR-Eval; (ii) reasoning models exhibit discipline bias and outperform LMMs only on specific subjects; (iii) open-source models still trail their proprietary counterparts; and (iv) fine-tuning on BMMR-Train narrows this gap. Additionally, we conduct reasoning-chain analyses using BMMR-Verifier and other in-depth studies, uncovering the challenges LMMs currently face in multidisciplinary reasoning. We will release the data, and we hope our work can offer insights and contributions to the community.

BMMR：大规模双语多模态跨学科推理数据集

BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset

摘要

Support