Baichuan-M2：通过大规模验证系统扩展医疗能力

摘要

随着大型语言模型（LLMs）在对话和推理能力上的不断进步，其在医疗保健领域的实际应用已成为一项关键研究焦点。然而，医疗LLMs在静态基准测试（如USMLE）上的表现与其在真实世界临床决策中的实用性之间存在显著差距。这种差异源于传统考试未能捕捉到医疗咨询的动态交互特性。为应对这一挑战，我们引入了一种新颖的动态验证框架，该框架超越了静态答案验证器，建立了一个大规模、高保真的交互式强化学习系统。我们的框架包含两个关键组件：利用去识别化医疗记录创建真实临床环境的患者模拟器，以及动态生成多维评估指标的临床评分标准生成器。在此基础上，我们开发了Baichuan-M2，这是一个拥有320亿参数的医疗增强推理模型，通过采用改进的群体相对策略优化（GRPO）算法的多阶段强化学习策略进行训练。在HealthBench上的评估显示，Baichuan-M2超越了所有其他开源模型及大多数先进的闭源模型，在极具挑战性的HealthBench Hard基准测试中得分超过32分——此前仅有GPT-5达到这一水平。我们的工作表明，强大的动态验证系统对于将LLM能力与实际临床应用对齐至关重要，为医疗AI部署在性能与参数权衡方面确立了新的帕累托前沿。

English

As large language models (LLMs) advance in conversational and reasoning capabilities, their practical application in healthcare has become a critical research focus. However, there is a notable gap between the performance of medical LLMs on static benchmarks such as USMLE and their utility in real-world clinical decision-making. This discrepancy arises because traditional exams fail to capture the dynamic, interactive nature of medical consultations. To address this challenge, we introduce a novel dynamic verification framework that moves beyond static answer verifier, establishing a large-scale, high-fidelity interactive reinforcement learning system. Our framework comprises two key components: a Patient Simulator that creates realistic clinical environments using de-identified medical records, and a Clinical Rubrics Generator that dynamically produces multi-dimensional evaluation metrics. Building on this foundation, we develop Baichuan-M2, a 32B-parameter medical augmented reasoning model trained through a multi-stage reinforcement learning strategy with an improved Group Relative Policy Optimization (GRPO) algorithm. Evaluated on HealthBench, Baichuan-M2 outperforms all other open-source models and most advanced closed-source counterparts, achieving a score above 32 on the challenging HealthBench Hard benchmark-previously exceeded only by GPT-5. Our work demonstrates that robust dynamic verifier system is essential for aligning LLM capabilities with practical clinical applications, establishing a new Pareto front in the performance-parameter trade-off for medical AI deployment.

Baichuan-M2：通过大规模验证系统扩展医疗能力

Baichuan-M2: Scaling Medical Capability with Large Verifier System

摘要

Support