Baichuan-M2: 大規模検証システムによる医療能力のスケーリング

要旨

大規模言語モデル（LLMs）の会話能力や推論能力が向上するにつれ、医療分野におけるその実用的な応用が重要な研究焦点となっている。しかし、USMLEなどの静的ベンチマークにおける医療LLMsの性能と、実際の臨床意思決定における有用性との間には顕著なギャップが存在する。この乖離は、従来の試験が医療相談の動的でインタラクティブな性質を捉えられないことに起因している。この課題に対処するため、我々は静的解答検証を超えた新たな動的検証フレームワークを導入し、大規模で高忠実度なインタラクティブ強化学習システムを構築した。このフレームワークは、匿名化された医療記録を用いて現実的な臨床環境を生成する「患者シミュレータ」と、多次元的な評価指標を動的に生成する「臨床ルーブリックジェネレータ」の2つの主要コンポーネントで構成される。この基盤を基に、改良されたGroup Relative Policy Optimization（GRPO）アルゴリズムを用いた多段階強化学習戦略を通じて訓練された32Bパラメータの医療拡張推論モデル「Baichuan-M2」を開発した。HealthBenchでの評価において、Baichuan-M2は他のすべてのオープンソースモデルを上回り、ほとんどの先進的なクローズドソースモデルをも凌駕し、困難なHealthBench Hardベンチマークで32以上のスコアを達成した。これは以前、GPT-5のみが達成していた記録である。我々の研究は、LLMの能力を実用的な臨床応用に適合させるためには、堅牢な動的検証システムが不可欠であることを示し、医療AIの性能とパラメータのトレードオフにおいて新たなパレートフロンティアを確立した。

English

As large language models (LLMs) advance in conversational and reasoning capabilities, their practical application in healthcare has become a critical research focus. However, there is a notable gap between the performance of medical LLMs on static benchmarks such as USMLE and their utility in real-world clinical decision-making. This discrepancy arises because traditional exams fail to capture the dynamic, interactive nature of medical consultations. To address this challenge, we introduce a novel dynamic verification framework that moves beyond static answer verifier, establishing a large-scale, high-fidelity interactive reinforcement learning system. Our framework comprises two key components: a Patient Simulator that creates realistic clinical environments using de-identified medical records, and a Clinical Rubrics Generator that dynamically produces multi-dimensional evaluation metrics. Building on this foundation, we develop Baichuan-M2, a 32B-parameter medical augmented reasoning model trained through a multi-stage reinforcement learning strategy with an improved Group Relative Policy Optimization (GRPO) algorithm. Evaluated on HealthBench, Baichuan-M2 outperforms all other open-source models and most advanced closed-source counterparts, achieving a score above 32 on the challenging HealthBench Hard benchmark-previously exceeded only by GPT-5. Our work demonstrates that robust dynamic verifier system is essential for aligning LLM capabilities with practical clinical applications, establishing a new Pareto front in the performance-parameter trade-off for medical AI deployment.

Baichuan-M2: 大規模検証システムによる医療能力のスケーリング

Baichuan-M2: Scaling Medical Capability with Large Verifier System

要旨

Support