Gazal-R1：通过参数高效的两阶段训练实现顶尖医疗推理能力

摘要

我们推出Gazal-R1，这是一个拥有320亿参数的语言模型，在医学推理领域实现了顶尖性能，同时为临床决策提供透明、逐步的解释。基于Qwen3 32B构建，我们的模型证明，通过策略性训练，中等规模模型能够在特定领域超越显著更大的对手。我们开发了一种新颖的两阶段训练流程：首先，在精心挑选的107,033个合成医学推理示例数据集上进行监督微调，教授结构化临床思维，并辅以包括权重分解低秩适应（DoRA）和秩稳定LoRA（rsLoRA）在内的先进参数高效技术；其次，采用群体相对策略优化（GRPO）进行强化学习，结合复杂多组件奖励系统，以提升准确性、格式遵循及推理质量。Gazal-R1在医学基准测试中表现卓越，MedQA得分87.1%，MMLU Pro（医学）得分81.6%，PubMedQA得分79.6%，超越了规模达其12倍的模型。除了强劲的实证结果外，本研究还深入探讨了在专业领域训练具备推理能力模型所面临的挑战，包括奖励黑客问题、训练不稳定性，以及事实回忆与详细推理之间的根本矛盾。我们的方法论为开发高性能、领域特定语言模型提供了一个可复现的框架，平衡了性能、效率与可解释性。

English

We present Gazal-R1, a 32-billion-parameter language model that achieves state-of-the-art performance in medical reasoning while providing transparent, step-by-step explanations for clinical decision-making. Built upon Qwen3 32B, our model demonstrates that strategic training can enable mid-sized models to outperform significantly larger counterparts in specialized domains. We developed a novel two-stage training pipeline: first, supervised fine-tuning on a carefully curated dataset of 107,033 synthetic medical reasoning examples that teaches structured clinical thinking, enhanced by advanced parameter-efficient techniques including Weight-Decomposed Low-Rank Adaptation (DoRA) and Rank-Stabilized LoRA (rsLoRA); second, reinforcement learning using Group Relative Policy Optimization (GRPO) with a sophisticated multi-component reward system that refines accuracy, format adherence, and reasoning quality. Gazal-R1 achieves exceptional performance across medical benchmarks, scoring 87.1% on MedQA, 81.6% on MMLU Pro (Medical), and 79.6% on PubMedQA, surpassing models up to 12x larger. Beyond its strong empirical results, this work provides detailed insights into the challenges of training reasoning-capable models in specialized domains, including issues with reward hacking, training instability, and the fundamental tension between factual recall and detailed reasoning. Our methodology offers a reproducible framework for developing high-capability, domain-specific language models that balance performance, efficiency, and explainability.

Gazal-R1：通过参数高效的两阶段训练实现顶尖医疗推理能力

Gazal-R1: Achieving State-of-the-Art Medical Reasoning with Parameter-Efficient Two-Stage Training

摘要

Support