Baichuan-M2:通过大规模验证系统扩展医疗能力
Baichuan-M2: Scaling Medical Capability with Large Verifier System
September 2, 2025
作者: Baichuan-M2 Team, Chengfeng Dou, Chong Liu, Fan Yang, Fei Li, Jiyuan Jia, Mingyang Chen, Qiang Ju, Shuai Wang, Shunya Dang, Tianpeng Li, Xiangrong Zeng, Yijie Zhou, Chenzheng Zhu, Da Pan, Fei Deng, Guangwei Ai, Guosheng Dong, Hongda Zhang, Jinyang Tai, Jixiang Hong, Kai Lu, Linzhuang Sun, Peidong Guo, Qian Ma, Rihui Xin, Shihui Yang, Shusen Zhang, Yichuan Mo, Zheng Liang, Zhishou Zhang, Hengfu Cui, Zuyi Zhu, Xiaochuan Wang
cs.AI
摘要
随着大型语言模型(LLMs)在对话和推理能力上的不断进步,其在医疗保健领域的实际应用已成为一项关键研究焦点。然而,医疗LLMs在静态基准测试(如USMLE)上的表现与其在真实世界临床决策中的实用性之间存在显著差距。这种差异源于传统考试未能捕捉到医疗咨询的动态交互特性。为应对这一挑战,我们引入了一种新颖的动态验证框架,该框架超越了静态答案验证器,建立了一个大规模、高保真的交互式强化学习系统。我们的框架包含两个关键组件:利用去识别化医疗记录创建真实临床环境的患者模拟器,以及动态生成多维评估指标的临床评分标准生成器。在此基础上,我们开发了Baichuan-M2,这是一个拥有320亿参数的医疗增强推理模型,通过采用改进的群体相对策略优化(GRPO)算法的多阶段强化学习策略进行训练。在HealthBench上的评估显示,Baichuan-M2超越了所有其他开源模型及大多数先进的闭源模型,在极具挑战性的HealthBench Hard基准测试中得分超过32分——此前仅有GPT-5达到这一水平。我们的工作表明,强大的动态验证系统对于将LLM能力与实际临床应用对齐至关重要,为医疗AI部署在性能与参数权衡方面确立了新的帕累托前沿。
English
As large language models (LLMs) advance in conversational and reasoning
capabilities, their practical application in healthcare has become a critical
research focus. However, there is a notable gap between the performance of
medical LLMs on static benchmarks such as USMLE and their utility in real-world
clinical decision-making. This discrepancy arises because traditional exams
fail to capture the dynamic, interactive nature of medical consultations. To
address this challenge, we introduce a novel dynamic verification framework
that moves beyond static answer verifier, establishing a large-scale,
high-fidelity interactive reinforcement learning system. Our framework
comprises two key components: a Patient Simulator that creates realistic
clinical environments using de-identified medical records, and a Clinical
Rubrics Generator that dynamically produces multi-dimensional evaluation
metrics. Building on this foundation, we develop Baichuan-M2, a 32B-parameter
medical augmented reasoning model trained through a multi-stage reinforcement
learning strategy with an improved Group Relative Policy Optimization (GRPO)
algorithm. Evaluated on HealthBench, Baichuan-M2 outperforms all other
open-source models and most advanced closed-source counterparts, achieving a
score above 32 on the challenging HealthBench Hard benchmark-previously
exceeded only by GPT-5. Our work demonstrates that robust dynamic verifier
system is essential for aligning LLM capabilities with practical clinical
applications, establishing a new Pareto front in the performance-parameter
trade-off for medical AI deployment.