MedAgentGym:大规模训练基于代码的医疗推理LLM智能体
MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale
June 4, 2025
作者: Ran Xu, Yuchen Zhuang, Yishan Zhong, Yue Yu, Xiangru Tang, Hang Wu, May D. Wang, Peifeng Ruan, Donghan Yang, Tao Wang, Guanghua Xiao, Carl Yang, Yang Xie, Wenqi Shi
cs.AI
摘要
我们推出MedAgentGYM,这是首个公开可用的训练环境,旨在提升大型语言模型(LLM)代理在基于编码的医学推理能力。MedAgentGYM包含72,413个任务实例,涵盖129个类别,均源自真实世界的生物医学场景。这些任务被封装在可执行的编码环境中,每个环境均配备详细的任务描述、互动反馈机制、可验证的真实标注以及可扩展的训练轨迹生成功能。通过对超过30个LLM的广泛基准测试,我们发现基于商业API的模型与开源模型之间存在显著的性能差异。利用MedAgentGYM,Med-Copilot-7B通过监督微调(+36.44%)和持续强化学习(+42.47%)实现了显著的性能提升,成为与gpt-4o相媲美的经济实惠且保护隐私的替代方案。通过提供全面的基准测试和易于访问、可扩展的训练资源,MedAgentGYM在统一的执行环境中为开发基于LLM的编码助手提供了一个集成平台,以支持高级生物医学研究和实践。
English
We introduce MedAgentGYM, the first publicly available training environment
designed to enhance coding-based medical reasoning capabilities in large
language model (LLM) agents. MedAgentGYM comprises 72,413 task instances across
129 categories derived from authentic real-world biomedical scenarios. Tasks
are encapsulated within executable coding environments, each featuring detailed
task descriptions, interactive feedback mechanisms, verifiable ground-truth
annotations, and scalable training trajectory generation. Extensive
benchmarking of over 30 LLMs reveals a notable performance disparity between
commercial API-based models and open-source counterparts. Leveraging
MedAgentGYM, Med-Copilot-7B achieves substantial performance gains through
supervised fine-tuning (+36.44%) and continued reinforcement learning
(+42.47%), emerging as an affordable and privacy-preserving alternative
competitive with gpt-4o. By offering both a comprehensive benchmark and
accessible, expandable training resources within unified execution
environments, MedAgentGYM delivers an integrated platform to develop LLM-based
coding assistants for advanced biomedical research and practice.