ChatPaper.aiChatPaper

医疗智能体AI训练平台

Healthcare AI GYM for Medical Agents

May 1, 2026
作者: Minbyul Jeong
cs.AI

摘要

临床推理需要多轮次交互——包括收集病史、安排检查、解读结果并制定安全治疗方案——然而现有训练环境难以兼顾临床领域的广度与专业工具的支持,无法通过强化学习训练出泛化性强的医疗AI智能体。我们基于覆盖10个临床领域、包含3600多项任务、135种专业工具及82.8万条医学知识片段的开放式训练环境,开展了医疗AI多轮智能体强化学习的实证研究。分析表明,智能体的多轮对话结构会退化为冗长的单轮独白,表现为对话长度单调激增与工具使用频率同步衰减。我们揭示了这种退化现象及蒸馏不稳定性源于稀疏终端奖励与序列化临床路径的错配。研究发现原始GRPO在某些基准测试中虽能达到较高最终准确率,但存在训练不稳定性,表现为响应长度的剧烈波动和收敛周期延长。为提升训练效率与稳定性,我们提出轮次截断同策略蒸馏(TT-OPD),该自蒸馏框架通过无梯度指数移动平均教师模型,利用结果先验信息在每轮对话中提供密集的结果感知KL正则化。TT-OPD在18项基准测试中10项表现最优,相对非RL基线平均提升3.9个百分点,并实现早期快速收敛、响应长度可控及持续的多轮工具使用。
English
Clinical reasoning demands multi-step interactions -- gathering patient history, ordering tests, interpreting results, and making safe treatment decisions -- yet a unified training environment provides the breadth of clinical domains and specialized tools to train generalizable medical AI agents through reinforcement learning remains elusive. We present a comprehensive empirical study of multi-turn agentic RL for medical AI, built on , a gymnasium-compatible environment spanning 10 clinical domains with 3.6K+ tasks, 135 domain-specific tools, and a knowledge base of 828K medical passages. Our analysis reveals that agentic multi-turn structure degrades into verbose single-turn monologues, characterized by monotonic length explosion and a simultaneous erosion of tool-use frequency. We characterize how this collapse, alongside distillation instability, stems from the misalignment of sparse terminal rewards with sequential clinical trajectories. We find that vanilla GRPO achieves strong final accuracy on some benchmarks but suffers from training instability, evidenced by significant oscillations in response length and prolonged convergence periods. To improve training efficiency and stability, we propose Turn-level Truncated On-Policy Distillation (TT-OPD), a self-distillation framework where a gradient-free EMA teacher leverages outcome-privileged information to provide dense, outcome-aware KL regularization at every conversation turn. TT-OPD achieves the best performance on 10 of 18 benchmarks with an average +3.9~pp improvement over the non-RL baseline with faster early convergence, controlled response length, and sustained multi-turn tool use.
PDF12May 7, 2026