MedAgentGym：大規模訓練基於代碼的醫療推理LLM代理

摘要

我們推出MedAgentGYM，這是首個公開可用的訓練環境，旨在提升大型語言模型（LLM）代理基於編碼的醫療推理能力。MedAgentGYM包含來自真實世界生物醫學場景的129個類別共72,413個任務實例。這些任務被封裝在可執行的編碼環境中，每個環境都配備了詳細的任務描述、互動反饋機制、可驗證的真實標註以及可擴展的訓練軌跡生成。對超過30個LLM的廣泛基準測試顯示，基於商業API的模型與開源模型之間存在顯著的性能差距。利用MedAgentGYM，Med-Copilot-7B通過監督微調（+36.44%）和持續強化學習（+42.47%）實現了顯著的性能提升，成為一個具有成本效益且保護隱私的替代方案，與gpt-4o競爭。通過提供全面的基準測試和統一執行環境中可訪問、可擴展的訓練資源，MedAgentGYM為開發基於LLM的先進生物醫學研究和實踐編碼助手提供了一個整合平台。

English

We introduce MedAgentGYM, the first publicly available training environment designed to enhance coding-based medical reasoning capabilities in large language model (LLM) agents. MedAgentGYM comprises 72,413 task instances across 129 categories derived from authentic real-world biomedical scenarios. Tasks are encapsulated within executable coding environments, each featuring detailed task descriptions, interactive feedback mechanisms, verifiable ground-truth annotations, and scalable training trajectory generation. Extensive benchmarking of over 30 LLMs reveals a notable performance disparity between commercial API-based models and open-source counterparts. Leveraging MedAgentGYM, Med-Copilot-7B achieves substantial performance gains through supervised fine-tuning (+36.44%) and continued reinforcement learning (+42.47%), emerging as an affordable and privacy-preserving alternative competitive with gpt-4o. By offering both a comprehensive benchmark and accessible, expandable training resources within unified execution environments, MedAgentGYM delivers an integrated platform to develop LLM-based coding assistants for advanced biomedical research and practice.

MedAgentGym：大規模訓練基於代碼的醫療推理LLM代理

MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale

摘要

Support