Gazal-R1：通过参数高效的两阶段训练实现顶尖医疗推理

摘要

我们推出Gazal-R1，这是一个拥有320亿参数的语言模型，在医学推理领域实现了最先进的性能，同时为临床决策提供了透明、逐步的解释。基于Qwen3 32B构建，我们的模型展示了通过策略性训练，中等规模的模型能够在专业领域中显著超越更大规模的对手。我们开发了一种新颖的两阶段训练流程：首先，在精心策划的107,033个合成医学推理示例数据集上进行监督微调，教授结构化的临床思维，并通过包括权重分解低秩适应（DoRA）和秩稳定LoRA（rsLoRA）在内的先进参数高效技术增强；其次，使用具有复杂多组件奖励系统的组相对策略优化（GRPO）进行强化学习，以优化准确性、格式遵循和推理质量。Gazal-R1在医学基准测试中表现出色，在MedQA上得分87.1%，在MMLU Pro（医学）上得分81.6%，在PubMedQA上得分79.6%，超越了规模高达12倍的模型。除了其强大的实证结果外，这项工作还深入探讨了在专业领域训练具备推理能力的模型所面临的挑战，包括奖励黑客问题、训练不稳定性以及事实回忆与详细推理之间的根本性张力。我们的方法论为开发高性能、特定领域的语言模型提供了一个可复现的框架，平衡了性能、效率和可解释性。

English

We present Gazal-R1, a 32-billion-parameter language model that achieves state-of-the-art performance in medical reasoning while providing transparent, step-by-step explanations for clinical decision-making. Built upon Qwen3 32B, our model demonstrates that strategic training can enable mid-sized models to outperform significantly larger counterparts in specialized domains. We developed a novel two-stage training pipeline: first, supervised fine-tuning on a carefully curated dataset of 107,033 synthetic medical reasoning examples that teaches structured clinical thinking, enhanced by advanced parameter-efficient techniques including Weight-Decomposed Low-Rank Adaptation (DoRA) and Rank-Stabilized LoRA (rsLoRA); second, reinforcement learning using Group Relative Policy Optimization (GRPO) with a sophisticated multi-component reward system that refines accuracy, format adherence, and reasoning quality. Gazal-R1 achieves exceptional performance across medical benchmarks, scoring 87.1% on MedQA, 81.6% on MMLU Pro (Medical), and 79.6% on PubMedQA, surpassing models up to 12x larger. Beyond its strong empirical results, this work provides detailed insights into the challenges of training reasoning-capable models in specialized domains, including issues with reward hacking, training instability, and the fundamental tension between factual recall and detailed reasoning. Our methodology offers a reproducible framework for developing high-capability, domain-specific language models that balance performance, efficiency, and explainability.

Gazal-R1：通过参数高效的两阶段训练实现顶尖医疗推理

Gazal-R1: Achieving State-of-the-Art Medical Reasoning with Parameter-Efficient Two-Stage Training

摘要

Support