Gazal-R1: 파라미터 효율적인 2단계 학습을 통한 최첨단 의료 추론 달성

초록

우리는 의료 추론 분야에서 최첨단 성능을 달성하면서 임상 의사결정에 대한 투명하고 단계별 설명을 제공하는 320억 개의 파라미터를 가진 언어 모델인 Gazal-R1을 소개한다. Qwen3 32B를 기반으로 구축된 이 모델은 전략적인 훈련을 통해 중간 규모의 모델이 특수 분야에서 훨씬 더 큰 모델을 능가할 수 있음을 보여준다. 우리는 새로운 두 단계 훈련 파이프라인을 개발했다: 첫째, 구조화된 임상 사고를 가르치는 107,033개의 합성 의료 추론 예제로 구성된 신중하게 선별된 데이터셋에 대한 지도 미세 조정을 수행하였으며, 이는 Weight-Decomposed Low-Rank Adaptation(DoRA) 및 Rank-Stabilized LoRA(rsLoRA)와 같은 고급 파라미터 효율 기술로 강화되었다. 둘째, 정확성, 형식 준수 및 추론 품질을 개선하는 정교한 다중 구성 요소 보상 시스템과 함께 Group Relative Policy Optimization(GRPO)을 사용한 강화 학습을 적용했다. Gazal-R1은 의료 벤치마크에서 뛰어난 성능을 보이며, MedQA에서 87.1%, MMLU Pro(Medical)에서 81.6%, PubMedQA에서 79.6%의 점수를 기록하여 최대 12배 더 큰 모델을 능가했다. 강력한 실험 결과를 넘어, 이 연구는 보장 해킹, 훈련 불안정성, 사실 회상과 상세 추론 사이의 근본적인 긴장을 포함하여 특수 분야에서 추론 능력을 갖춘 모델을 훈련하는 데 따른 도전 과제에 대한 상세한 통찰을 제공한다. 우리의 방법론은 성능, 효율성 및 설명 가능성의 균형을 맞추는 고성능 도메인 특화 언어 모델을 개발하기 위한 재현 가능한 프레임워크를 제시한다.

English

We present Gazal-R1, a 32-billion-parameter language model that achieves state-of-the-art performance in medical reasoning while providing transparent, step-by-step explanations for clinical decision-making. Built upon Qwen3 32B, our model demonstrates that strategic training can enable mid-sized models to outperform significantly larger counterparts in specialized domains. We developed a novel two-stage training pipeline: first, supervised fine-tuning on a carefully curated dataset of 107,033 synthetic medical reasoning examples that teaches structured clinical thinking, enhanced by advanced parameter-efficient techniques including Weight-Decomposed Low-Rank Adaptation (DoRA) and Rank-Stabilized LoRA (rsLoRA); second, reinforcement learning using Group Relative Policy Optimization (GRPO) with a sophisticated multi-component reward system that refines accuracy, format adherence, and reasoning quality. Gazal-R1 achieves exceptional performance across medical benchmarks, scoring 87.1% on MedQA, 81.6% on MMLU Pro (Medical), and 79.6% on PubMedQA, surpassing models up to 12x larger. Beyond its strong empirical results, this work provides detailed insights into the challenges of training reasoning-capable models in specialized domains, including issues with reward hacking, training instability, and the fundamental tension between factual recall and detailed reasoning. Our methodology offers a reproducible framework for developing high-capability, domain-specific language models that balance performance, efficiency, and explainability.

Gazal-R1: 파라미터 효율적인 2단계 학습을 통한 최첨단 의료 추론 달성

Gazal-R1: Achieving State-of-the-Art Medical Reasoning with Parameter-Efficient Two-Stage Training

초록

Support