白川-M2：利用大型验证系统扩展医疗能力

摘要

隨著大型語言模型（LLMs）在對話與推理能力上的進步，其在醫療保健領域的實際應用已成為關鍵研究焦點。然而，醫學LLMs在如USMLE等靜態基準測試上的表現與其在真實世界臨床決策中的實用性之間存在顯著差距。這一差異源於傳統考試未能捕捉到醫療諮詢的動態互動特性。為應對這一挑戰，我們引入了一種新穎的動態驗證框架，該框架超越了靜態答案驗證器，建立了一個大規模、高保真度的互動式強化學習系統。我們的框架包含兩個核心組件：一個利用去識別化醫療記錄創建真實臨床環境的患者模擬器，以及一個動態生成多維度評估指標的臨床評分標準生成器。基於此，我們開發了Baichuan-M2，這是一個擁有320億參數的醫學增強推理模型，通過採用改進的群組相對策略優化（GRPO）算法的多階段強化學習策略進行訓練。在HealthBench上的評估顯示，Baichuan-M2超越了所有其他開源模型及多數先進的閉源模型，在極具挑戰性的HealthBench Hard基準測試中得分超過32分——此前僅有GPT-5達成此成就。我們的工作表明，強大的動態驗證系統對於將LLM能力與實際臨床應用對齊至關重要，為醫學AI部署在性能與參數權衡中開闢了新的帕累托前沿。

English

As large language models (LLMs) advance in conversational and reasoning capabilities, their practical application in healthcare has become a critical research focus. However, there is a notable gap between the performance of medical LLMs on static benchmarks such as USMLE and their utility in real-world clinical decision-making. This discrepancy arises because traditional exams fail to capture the dynamic, interactive nature of medical consultations. To address this challenge, we introduce a novel dynamic verification framework that moves beyond static answer verifier, establishing a large-scale, high-fidelity interactive reinforcement learning system. Our framework comprises two key components: a Patient Simulator that creates realistic clinical environments using de-identified medical records, and a Clinical Rubrics Generator that dynamically produces multi-dimensional evaluation metrics. Building on this foundation, we develop Baichuan-M2, a 32B-parameter medical augmented reasoning model trained through a multi-stage reinforcement learning strategy with an improved Group Relative Policy Optimization (GRPO) algorithm. Evaluated on HealthBench, Baichuan-M2 outperforms all other open-source models and most advanced closed-source counterparts, achieving a score above 32 on the challenging HealthBench Hard benchmark-previously exceeded only by GPT-5. Our work demonstrates that robust dynamic verifier system is essential for aligning LLM capabilities with practical clinical applications, establishing a new Pareto front in the performance-parameter trade-off for medical AI deployment.