医療エージェントのためのヘルスケアAIジム

要旨

臨床推論には、患者の病歴収集、検査オーダー、結果の解釈、安全な治療方針の決定といった多段階の相互作用が要求される。しかし、強化学習を通じて汎化性の高い医療AIエージェントを訓練するためには、臨床領域の広がりと専門的なツールを備えた統一的な訓練環境が不可欠であるものの、その実現は未だ困難である。本研究では、10の臨床領域、3,600以上のタスク、135の領域特化ツール、82万8千の医学文献からなる知識ベースを備えたgymnasium互換環境「」を基盤とし、医療AIにおける多対話ターン強化学習の実証的検証を行う。分析の結果、エージェントの多ターン構造は、単調な長文爆発とツール使用頻度の低下を特徴とする、冗長な単一ターンの独話へと退化することが明らかとなった。この崩壊と蒸留不安定性は、疎な終端報酬が逐次的な臨床経路と整合しないことに起因することを解明する。従来のGRPOは一部のベンチマークで高い最終精度を達成するものの、応答長の大幅な振動や収束の遅延に示される訓練不安定性に悩まされる。訓練効率と安定性の向上を図り、我々はターンレベル切り捨てオンライン蒸留（TT-OPD）を提案する。これは、勾配を含まないEMA教師モデルが結果特権情報を利用し、各対話ターンにおいて密な結果認識型KL正則化を提供する自己蒸留フレームワークである。TT-OPDは、18のベンチマークのうち10において最高性能を達成し、非RLベースラインに対し平均+3.9パーセントポイントの改善を示した。加えて、早期収束の加速、応答長の制御、持続的な多ターンツール使用を実現した。

English

Clinical reasoning demands multi-step interactions -- gathering patient history, ordering tests, interpreting results, and making safe treatment decisions -- yet a unified training environment provides the breadth of clinical domains and specialized tools to train generalizable medical AI agents through reinforcement learning remains elusive. We present a comprehensive empirical study of multi-turn agentic RL for medical AI, built on , a gymnasium-compatible environment spanning 10 clinical domains with 3.6K+ tasks, 135 domain-specific tools, and a knowledge base of 828K medical passages. Our analysis reveals that agentic multi-turn structure degrades into verbose single-turn monologues, characterized by monotonic length explosion and a simultaneous erosion of tool-use frequency. We characterize how this collapse, alongside distillation instability, stems from the misalignment of sparse terminal rewards with sequential clinical trajectories. We find that vanilla GRPO achieves strong final accuracy on some benchmarks but suffers from training instability, evidenced by significant oscillations in response length and prolonged convergence periods. To improve training efficiency and stability, we propose Turn-level Truncated On-Policy Distillation (TT-OPD), a self-distillation framework where a gradient-free EMA teacher leverages outcome-privileged information to provide dense, outcome-aware KL regularization at every conversation turn. TT-OPD achieves the best performance on 10 of 18 benchmarks with an average +3.9~pp improvement over the non-RL baseline with faster early convergence, controlled response length, and sustained multi-turn tool use.

医療エージェントのためのヘルスケアAIジム

Healthcare AI GYM for Medical Agents

要旨

Support