Gazal-R1：パラメータ効率の良い2段階トレーニングによる最先端の医療推論の実現

要旨

本研究では、Gazal-R1という320億パラメータの言語モデルを紹介する。このモデルは、医療推論において最先端の性能を達成し、臨床意思決定に対する透明性のある段階的な説明を提供する。Qwen3 32Bを基盤として構築された本モデルは、戦略的なトレーニングによって中規模モデルが専門分野において大幅に大規模なモデルを凌駕し得ることを示している。我々は、新たな2段階トレーニングパイプラインを開発した。第1段階では、構造化された臨床思考を教えるために107,033の合成医療推論事例からなる厳選されたデータセットを用いた教師ありファインチューニングを実施し、Weight-Decomposed Low-Rank Adaptation (DoRA) や Rank-Stabilized LoRA (rsLoRA) といった高度なパラメータ効率化技術を活用した。第2段階では、精度、フォーマット遵守、推論品質を向上させるための複雑な多要素報酬システムを備えたGroup Relative Policy Optimization (GRPO) を用いた強化学習を行った。Gazal-R1は、医療ベンチマークにおいて卓越した性能を発揮し、MedQAで87.1%、MMLU Pro (Medical) で81.6%、PubMedQAで79.6%のスコアを達成し、最大12倍の規模を持つモデルを上回った。強力な実証結果に加えて、本研究は、報酬ハッキング、トレーニング不安定性、事実の想起と詳細な推論の間の根本的な緊張関係など、専門分野における推論能力を持つモデルのトレーニングにおける課題について詳細な洞察を提供する。我々の方法論は、性能、効率性、説明可能性のバランスを取った高能力なドメイン特化型言語モデルを開発するための再現可能なフレームワークを提供する。

English

We present Gazal-R1, a 32-billion-parameter language model that achieves state-of-the-art performance in medical reasoning while providing transparent, step-by-step explanations for clinical decision-making. Built upon Qwen3 32B, our model demonstrates that strategic training can enable mid-sized models to outperform significantly larger counterparts in specialized domains. We developed a novel two-stage training pipeline: first, supervised fine-tuning on a carefully curated dataset of 107,033 synthetic medical reasoning examples that teaches structured clinical thinking, enhanced by advanced parameter-efficient techniques including Weight-Decomposed Low-Rank Adaptation (DoRA) and Rank-Stabilized LoRA (rsLoRA); second, reinforcement learning using Group Relative Policy Optimization (GRPO) with a sophisticated multi-component reward system that refines accuracy, format adherence, and reasoning quality. Gazal-R1 achieves exceptional performance across medical benchmarks, scoring 87.1% on MedQA, 81.6% on MMLU Pro (Medical), and 79.6% on PubMedQA, surpassing models up to 12x larger. Beyond its strong empirical results, this work provides detailed insights into the challenges of training reasoning-capable models in specialized domains, including issues with reward hacking, training instability, and the fundamental tension between factual recall and detailed reasoning. Our methodology offers a reproducible framework for developing high-capability, domain-specific language models that balance performance, efficiency, and explainability.

Gazal-R1：パラメータ効率の良い2段階トレーニングによる最先端の医療推論の実現

Gazal-R1: Achieving State-of-the-Art Medical Reasoning with Parameter-Efficient Two-Stage Training

要旨

Support