MedAgentGym：大规模训练基于代码的医疗推理LLM智能体

摘要

我们推出MedAgentGYM，这是首个公开可用的训练环境，旨在提升大型语言模型（LLM）代理在基于编码的医学推理能力。MedAgentGYM包含72,413个任务实例，涵盖129个类别，均源自真实世界的生物医学场景。这些任务被封装在可执行的编码环境中，每个环境均配备详细的任务描述、互动反馈机制、可验证的真实标注以及可扩展的训练轨迹生成功能。通过对超过30个LLM的广泛基准测试，我们发现基于商业API的模型与开源模型之间存在显著的性能差异。利用MedAgentGYM，Med-Copilot-7B通过监督微调（+36.44%）和持续强化学习（+42.47%）实现了显著的性能提升，成为与gpt-4o相媲美的经济实惠且保护隐私的替代方案。通过提供全面的基准测试和易于访问、可扩展的训练资源，MedAgentGYM在统一的执行环境中为开发基于LLM的编码助手提供了一个集成平台，以支持高级生物医学研究和实践。

English

We introduce MedAgentGYM, the first publicly available training environment designed to enhance coding-based medical reasoning capabilities in large language model (LLM) agents. MedAgentGYM comprises 72,413 task instances across 129 categories derived from authentic real-world biomedical scenarios. Tasks are encapsulated within executable coding environments, each featuring detailed task descriptions, interactive feedback mechanisms, verifiable ground-truth annotations, and scalable training trajectory generation. Extensive benchmarking of over 30 LLMs reveals a notable performance disparity between commercial API-based models and open-source counterparts. Leveraging MedAgentGYM, Med-Copilot-7B achieves substantial performance gains through supervised fine-tuning (+36.44%) and continued reinforcement learning (+42.47%), emerging as an affordable and privacy-preserving alternative competitive with gpt-4o. By offering both a comprehensive benchmark and accessible, expandable training resources within unified execution environments, MedAgentGYM delivers an integrated platform to develop LLM-based coding assistants for advanced biomedical research and practice.