BMMR: 대규모 이중언어 다중모드 다학제적 추론 데이터셋

초록

본 논문에서는 대규모 다중모달 모델(LMMs)의 개발 및 평가를 위해 커뮤니티에 제공할 대규모 이중언어, 다중모달, 다학제적 추론 데이터셋인 BMMR을 소개한다. BMMR은 300개의 UNESCO 정의 주제에 걸친 110,000개의 대학 수준 질문으로 구성되어 있으며, 다양한 형식(객관식, 빈칸 채우기, 자유형 질의응답)을 포함하고 책, 시험, 퀴즈 등 인쇄 및 디지털 매체에서 수집되었다. 모든 데이터는 인간 참여형 및 확장 가능한 프레임워크를 통해 선별 및 필터링되었으며, 각 인스턴스는 고품질의 추론 경로와 짝지어져 있다. 이 데이터셋은 두 부분으로 구성된다: BMMR-Eval은 중국어와 영어로 다양한 학문 분야에 걸친 LMMs의 지식과 추론 능력을 종합적으로 평가하기 위한 20,458개의 고품질 인스턴스를 포함하며, BMMR-Train은 현재의 수학적 추론 중심에서 다양한 학문 및 도메인으로 연구와 개발을 확장하기 위한 88,991개의 인스턴스를 제공한다. 또한, 정확하고 세밀한 추론 경로 평가를 위한 과정 기반 다학제 검증기(BMMR-Verifier)를 제안한다. 24개 모델에 대한 광범위한 실험 결과, (i) 최신 모델(예: o3 및 Gemini-2.5-Pro)도 BMMR-Eval에서 상당한 개선 여지가 있음, (ii) 추론 모델은 학문적 편향을 보이며 특정 주제에서만 LMMs를 능가함, (iii) 오픈소스 모델은 여전히 상용 모델에 뒤처짐, (iv) BMMR-Train에 대한 미세 조정은 이 격차를 줄임을 확인하였다. 또한, BMMR-Verifier 및 기타 심층 연구를 통해 LMMs가 현재 다학제적 추론에서 직면한 도전 과제를 밝혀냈다. 데이터를 공개할 예정이며, 본 연구가 커뮤니티에 통찰과 기여를 제공하기를 바란다.

English

In this paper, we introduce BMMR, a large-scale bilingual, multimodal, multi-disciplinary reasoning dataset for the community to develop and evaluate large multimodal models (LMMs). BMMR comprises 110k college-level questions spanning 300 UNESCO-defined subjects, spanning diverse formats-multiple-choice, fill-in-the-blank, and open-ended QA-and sourced from both print and digital media such as books, exams, and quizzes. All data are curated and filtered via a human-in-the-loop and scalable framework, and each instance is paired with a high-quality reasoning path. The dataset is organized into two parts: BMMR-Eval that comprises 20,458 high-quality instances to comprehensively assess LMMs' knowledge and reasoning across multiple disciplines in both Chinese and English; and BMMR-Train that contains 88,991 instances to support further research and development, extending the current focus on mathematical reasoning to diverse disciplines and domains. In addition, we propose the process-based multi-discipline verifier (i.e., BMMR-Verifier) for accurate and fine-grained evaluation of reasoning paths. Extensive experiments on 24 models reveal that (i) even SOTA models (e.g., o3 and Gemini-2.5-Pro) leave substantial headroom on BMMR-Eval; (ii) reasoning models exhibit discipline bias and outperform LMMs only on specific subjects; (iii) open-source models still trail their proprietary counterparts; and (iv) fine-tuning on BMMR-Train narrows this gap. Additionally, we conduct reasoning-chain analyses using BMMR-Verifier and other in-depth studies, uncovering the challenges LMMs currently face in multidisciplinary reasoning. We will release the data, and we hope our work can offer insights and contributions to the community.

BMMR: 대규모 이중언어 다중모드 다학제적 추론 데이터셋

BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset

초록

Support