ChatPaper.aiChatPaper

MedAgentGym:大規模訓練基於代碼的醫療推理LLM代理

MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale

June 4, 2025
作者: Ran Xu, Yuchen Zhuang, Yishan Zhong, Yue Yu, Xiangru Tang, Hang Wu, May D. Wang, Peifeng Ruan, Donghan Yang, Tao Wang, Guanghua Xiao, Carl Yang, Yang Xie, Wenqi Shi
cs.AI

摘要

我們推出MedAgentGYM,這是首個公開可用的訓練環境,旨在提升大型語言模型(LLM)代理基於編碼的醫療推理能力。MedAgentGYM包含來自真實世界生物醫學場景的129個類別共72,413個任務實例。這些任務被封裝在可執行的編碼環境中,每個環境都配備了詳細的任務描述、互動反饋機制、可驗證的真實標註以及可擴展的訓練軌跡生成。對超過30個LLM的廣泛基準測試顯示,基於商業API的模型與開源模型之間存在顯著的性能差距。利用MedAgentGYM,Med-Copilot-7B通過監督微調(+36.44%)和持續強化學習(+42.47%)實現了顯著的性能提升,成為一個具有成本效益且保護隱私的替代方案,與gpt-4o競爭。通過提供全面的基準測試和統一執行環境中可訪問、可擴展的訓練資源,MedAgentGYM為開發基於LLM的先進生物醫學研究和實踐編碼助手提供了一個整合平台。
English
We introduce MedAgentGYM, the first publicly available training environment designed to enhance coding-based medical reasoning capabilities in large language model (LLM) agents. MedAgentGYM comprises 72,413 task instances across 129 categories derived from authentic real-world biomedical scenarios. Tasks are encapsulated within executable coding environments, each featuring detailed task descriptions, interactive feedback mechanisms, verifiable ground-truth annotations, and scalable training trajectory generation. Extensive benchmarking of over 30 LLMs reveals a notable performance disparity between commercial API-based models and open-source counterparts. Leveraging MedAgentGYM, Med-Copilot-7B achieves substantial performance gains through supervised fine-tuning (+36.44%) and continued reinforcement learning (+42.47%), emerging as an affordable and privacy-preserving alternative competitive with gpt-4o. By offering both a comprehensive benchmark and accessible, expandable training resources within unified execution environments, MedAgentGYM delivers an integrated platform to develop LLM-based coding assistants for advanced biomedical research and practice.
PDF41June 6, 2025