ChatPaper.aiChatPaper

MedAgentGym:大规模训练基于代码的医疗推理LLM智能体

MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale

June 4, 2025
作者: Ran Xu, Yuchen Zhuang, Yishan Zhong, Yue Yu, Xiangru Tang, Hang Wu, May D. Wang, Peifeng Ruan, Donghan Yang, Tao Wang, Guanghua Xiao, Carl Yang, Yang Xie, Wenqi Shi
cs.AI

摘要

我们推出MedAgentGYM,这是首个公开可用的训练环境,旨在提升大型语言模型(LLM)代理在基于编码的医学推理能力。MedAgentGYM包含72,413个任务实例,涵盖129个类别,均源自真实世界的生物医学场景。这些任务被封装在可执行的编码环境中,每个环境均配备详细的任务描述、互动反馈机制、可验证的真实标注以及可扩展的训练轨迹生成功能。通过对超过30个LLM的广泛基准测试,我们发现基于商业API的模型与开源模型之间存在显著的性能差异。利用MedAgentGYM,Med-Copilot-7B通过监督微调(+36.44%)和持续强化学习(+42.47%)实现了显著的性能提升,成为与gpt-4o相媲美的经济实惠且保护隐私的替代方案。通过提供全面的基准测试和易于访问、可扩展的训练资源,MedAgentGYM在统一的执行环境中为开发基于LLM的编码助手提供了一个集成平台,以支持高级生物医学研究和实践。
English
We introduce MedAgentGYM, the first publicly available training environment designed to enhance coding-based medical reasoning capabilities in large language model (LLM) agents. MedAgentGYM comprises 72,413 task instances across 129 categories derived from authentic real-world biomedical scenarios. Tasks are encapsulated within executable coding environments, each featuring detailed task descriptions, interactive feedback mechanisms, verifiable ground-truth annotations, and scalable training trajectory generation. Extensive benchmarking of over 30 LLMs reveals a notable performance disparity between commercial API-based models and open-source counterparts. Leveraging MedAgentGYM, Med-Copilot-7B achieves substantial performance gains through supervised fine-tuning (+36.44%) and continued reinforcement learning (+42.47%), emerging as an affordable and privacy-preserving alternative competitive with gpt-4o. By offering both a comprehensive benchmark and accessible, expandable training resources within unified execution environments, MedAgentGYM delivers an integrated platform to develop LLM-based coding assistants for advanced biomedical research and practice.
PDF41June 6, 2025