MedAgentGym:大規模訓練基於代碼的醫療推理LLM代理
MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale
June 4, 2025
作者: Ran Xu, Yuchen Zhuang, Yishan Zhong, Yue Yu, Xiangru Tang, Hang Wu, May D. Wang, Peifeng Ruan, Donghan Yang, Tao Wang, Guanghua Xiao, Carl Yang, Yang Xie, Wenqi Shi
cs.AI
摘要
我們推出MedAgentGYM,這是首個公開可用的訓練環境,旨在提升大型語言模型(LLM)代理基於編碼的醫療推理能力。MedAgentGYM包含來自真實世界生物醫學場景的129個類別共72,413個任務實例。這些任務被封裝在可執行的編碼環境中,每個環境都配備了詳細的任務描述、互動反饋機制、可驗證的真實標註以及可擴展的訓練軌跡生成。對超過30個LLM的廣泛基準測試顯示,基於商業API的模型與開源模型之間存在顯著的性能差距。利用MedAgentGYM,Med-Copilot-7B通過監督微調(+36.44%)和持續強化學習(+42.47%)實現了顯著的性能提升,成為一個具有成本效益且保護隱私的替代方案,與gpt-4o競爭。通過提供全面的基準測試和統一執行環境中可訪問、可擴展的訓練資源,MedAgentGYM為開發基於LLM的先進生物醫學研究和實踐編碼助手提供了一個整合平台。
English
We introduce MedAgentGYM, the first publicly available training environment
designed to enhance coding-based medical reasoning capabilities in large
language model (LLM) agents. MedAgentGYM comprises 72,413 task instances across
129 categories derived from authentic real-world biomedical scenarios. Tasks
are encapsulated within executable coding environments, each featuring detailed
task descriptions, interactive feedback mechanisms, verifiable ground-truth
annotations, and scalable training trajectory generation. Extensive
benchmarking of over 30 LLMs reveals a notable performance disparity between
commercial API-based models and open-source counterparts. Leveraging
MedAgentGYM, Med-Copilot-7B achieves substantial performance gains through
supervised fine-tuning (+36.44%) and continued reinforcement learning
(+42.47%), emerging as an affordable and privacy-preserving alternative
competitive with gpt-4o. By offering both a comprehensive benchmark and
accessible, expandable training resources within unified execution
environments, MedAgentGYM delivers an integrated platform to develop LLM-based
coding assistants for advanced biomedical research and practice.