MedAgentGym: 코드 기반 의료 추론을 위한 대규모 LLM 에이전트 훈련

초록

우리는 코딩 기반 의료 추론 능력을 대형 언어 모델(LLM) 에이전트에서 향상시키기 위해 설계된 최초의 공개 훈련 환경인 MedAgentGYM을 소개합니다. MedAgentGYM은 실제 생물의학 시나리오에서 도출된 129개 카테고리와 72,413개의 작업 인스턴스로 구성되어 있습니다. 각 작업은 실행 가능한 코딩 환경 내에 캡슐화되어 있으며, 상세한 작업 설명, 상호작용 피드백 메커니즘, 검증 가능한 정답 주석, 그리고 확장 가능한 훈련 궤적 생성을 포함합니다. 30개 이상의 LLM에 대한 광범위한 벤치마킹 결과, 상용 API 기반 모델과 오픈소스 모델 간에 뚜렷한 성능 차이가 나타났습니다. MedAgentGYM을 활용하여 Med-Copilot-7B는 지도 미세 조정(+36.44%)과 지속적인 강화 학습(+42.47%)을 통해 상당한 성능 향상을 달성했으며, gpt-4o와 경쟁력 있는 저렴하고 개인정보 보호가 가능한 대안으로 부상했습니다. MedAgentGYM은 통합 실행 환경 내에서 포괄적인 벤치마크와 접근 가능하며 확장 가능한 훈련 리소스를 제공함으로써, 고급 생물의학 연구 및 실습을 위한 LLM 기반 코딩 어시스턴트 개발을 위한 통합 플랫폼을 제공합니다.

English

We introduce MedAgentGYM, the first publicly available training environment designed to enhance coding-based medical reasoning capabilities in large language model (LLM) agents. MedAgentGYM comprises 72,413 task instances across 129 categories derived from authentic real-world biomedical scenarios. Tasks are encapsulated within executable coding environments, each featuring detailed task descriptions, interactive feedback mechanisms, verifiable ground-truth annotations, and scalable training trajectory generation. Extensive benchmarking of over 30 LLMs reveals a notable performance disparity between commercial API-based models and open-source counterparts. Leveraging MedAgentGYM, Med-Copilot-7B achieves substantial performance gains through supervised fine-tuning (+36.44%) and continued reinforcement learning (+42.47%), emerging as an affordable and privacy-preserving alternative competitive with gpt-4o. By offering both a comprehensive benchmark and accessible, expandable training resources within unified execution environments, MedAgentGYM delivers an integrated platform to develop LLM-based coding assistants for advanced biomedical research and practice.

MedAgentGym: 코드 기반 의료 추론을 위한 대규모 LLM 에이전트 훈련

MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale

초록

Support