基于知识增强数据合成的医疗推理引导：一种半监督强化学习方法

摘要

尽管大语言模型在复杂医疗应用中展现出潜力，但其发展受限于高质量推理数据的稀缺性。为解决这一问题，现有方法通常通过监督微调从大型专有模型中提炼思维链推理轨迹，继而进行强化学习。这些方法在罕见病等代表性不足的领域改进有限，且生成复杂推理链的成本高昂。为高效提升医疗推理能力，我们提出MedSSR——一种融合医学知识增强的数据合成与半监督强化学习框架。该框架首先利用罕见病知识合成分布可控的推理问题，随后通过策略模型自身生成高质量伪标签，形成由内而外的两阶段训练范式：先在伪标注合成数据上进行自监督强化学习，再基于人工标注的真实数据开展监督强化学习。MedSSR无需依赖高成本的轨迹提炼即可高效扩展模型训练。在Qwen和Llama上的大量实验表明，本方法在十项医疗基准测试中均超越现有方案，在罕见病任务上最高可获得+5.93%的性能提升。代码已开源：https://github.com/tdlhl/MedSSR。

English

While large language models hold promise for complex medical applications, their development is hindered by the scarcity of high-quality reasoning data. To address this issue, existing approaches typically distill chain-of-thought reasoning traces from large proprietary models via supervised fine-tuning, then conduct reinforcement learning (RL). These methods exhibit limited improvement on underrepresented domains like rare diseases while incurring substantial costs from generating complex reasoning chains. To efficiently enhance medical reasoning, we propose MedSSR, a Medical Knowledge-enhanced data Synthesis and Semi-supervised Reinforcement learning framework. Our framework first employs rare disease knowledge to synthesize distribution-controllable reasoning questions. We then utilize the policy model itself to generate high-quality pseudo-labels. This enables a two-stage, intrinsic-to-extrinsic training paradigm: self-supervised RL on the pseudo-labeled synthetic data, followed by supervised RL on the human-annotated real data. MedSSR scales model training efficiently without relying on costly trace distillation. Extensive experiments on Qwen and Llama demonstrate that our method outperforms existing methods across ten medical benchmarks, achieving up to +5.93% gain on rare-disease tasks. Our code is available at https://github.com/tdlhl/MedSSR.