메가사이언스: 과학적 추론을 위한 사후 학습 데이터셋의 경계를 넓히다

초록

과학적 추론은 AI 과학자 개발과 인간 연구원들이 자연과학 발견의 최전선을 개척하는 데 있어 핵심적인 역할을 합니다. 그러나 오픈소스 커뮤니티는 주로 수학과 코딩에 초점을 맞추며 과학적 영역을 소홀히 해왔는데, 이는 주로 개방적이고 대규모이며 고품질이고 검증 가능한 과학적 추론 데이터셋의 부재 때문입니다. 이러한 격차를 해소하기 위해, 우리는 먼저 TextbookReasoning을 소개합니다. 이는 12,000개의 대학 수준 과학 교과서에서 추출한 정확한 참조 답변을 포함한 오픈 데이터셋으로, 7개의 과학 분야에 걸친 650,000개의 추론 문제로 구성되어 있습니다. 또한, 우리는 MegaScience를 소개합니다. 이는 1.25백만 개의 인스턴스로 구성된 대규모 고품질 오픈소스 데이터셋의 혼합물로, 다양한 데이터 선택 방법론을 평가하여 공개된 각 과학 데이터셋에 대한 최적의 부분집합을 식별하기 위한 체계적인 절제 연구를 통해 개발되었습니다. 한편, 우리는 15개의 벤치마크에 걸친 다양한 주제와 질문 유형을 포괄하는 종합적인 평가 시스템을 구축하였으며, 정확한 평가 지표를 보장하기 위해 포괄적인 답변 추출 전략을 통합하였습니다. 우리의 실험은 우리의 데이터셋이 기존의 오픈소스 과학 데이터셋에 비해 더 간결한 응답 길이로 우수한 성능과 훈련 효율성을 달성함을 보여줍니다. 더 나아가, 우리는 MegaScience를 사용하여 Llama3.1, Qwen2.5, 그리고 Qwen3 시리즈 베이스 모델을 훈련시켰으며, 이들은 평균 성능에서 해당 공식 지시 모델을 크게 능가했습니다. 또한, MegaScience는 더 크고 강력한 모델에 대해 더 큰 효과를 보여주며, 과학적 튜닝에 대한 확장 이점을 시사합니다. 우리는 과학적 추론 연구를 발전시키기 위해 데이터 큐레이션 파이프라인, 평가 시스템, 데이터셋, 그리고 훈련된 7개의 모델을 커뮤니티에 공개합니다.

English

Scientific reasoning is critical for developing AI scientists and supporting human researchers in advancing the frontiers of natural science discovery. However, the open-source community has primarily focused on mathematics and coding while neglecting the scientific domain, largely due to the absence of open, large-scale, high-quality, verifiable scientific reasoning datasets. To bridge this gap, we first present TextbookReasoning, an open dataset featuring truthful reference answers extracted from 12k university-level scientific textbooks, comprising 650k reasoning questions spanning 7 scientific disciplines. We further introduce MegaScience, a large-scale mixture of high-quality open-source datasets totaling 1.25 million instances, developed through systematic ablation studies that evaluate various data selection methodologies to identify the optimal subset for each publicly available scientific dataset. Meanwhile, we build a comprehensive evaluation system covering diverse subjects and question types across 15 benchmarks, incorporating comprehensive answer extraction strategies to ensure accurate evaluation metrics. Our experiments demonstrate that our datasets achieve superior performance and training efficiency with more concise response lengths compared to existing open-source scientific datasets. Furthermore, we train Llama3.1, Qwen2.5, and Qwen3 series base models on MegaScience, which significantly outperform the corresponding official instruct models in average performance. In addition, MegaScience exhibits greater effectiveness for larger and stronger models, suggesting a scaling benefit for scientific tuning. We release our data curation pipeline, evaluation system, datasets, and seven trained models to the community to advance scientific reasoning research.

메가사이언스: 과학적 추론을 위한 사후 학습 데이터셋의 경계를 넓히다

MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning

초록

Support