MediX-R1: 개방형 의료 강화 학습

초록

MediX-R1을 소개합니다. 이는 의료 멀티모달 대규모 언어 모델(MLLM)을 위한 개방형 강화 학습(RL) 프레임워크로, 객관식 형식을 넘어 임상적으로 근거 있는 자유 형식 답변을 가능하게 합니다. MediX-R1은 그룹 기반 RL 및 의료 추론에 맞춤화된 복합 보상 함수를 사용하여 베이스라인 비전-언어 백본을 미세 조정합니다. 이 보상 함수는 의미적 정확성을 YES/NO로 엄격하게 판단하는 LLM 기반 정확도 보상, 파라프레이즈 및 용어 변형을 포착하는 의료 임베딩 기반 의미 보상, 해석 가능한 추론과 모달리티 인식을 강화하는 경량의 형식 및 모달리티 보상으로 구성됩니다. 이러한 다중 신호 설계는 검증 가능한 보상이나 객관식 전용 보상으로는 부족한 개방형 출력에 대해 안정적이고 유익한 피드백을 제공합니다. 진행 상황을 측정하기 위해 텍스트 전용 및 이미지+텍스트 작업 모두를 위한 통합 평가 프레임워크를 제안합니다. 이 프레임워크는 취약한 문자열 중복 메트릭 대신 참조 기반 LLM-as-judge를 사용하여 의미적 정확성, 추론, 맥락적 일관성을 포착합니다. 단 51K개의 시뮬레이션 지시 예시만을 사용했음에도 불구하고, MediX-R1은 표준 의료 LLM(텍스트 전용) 및 VLM(이미지 + 텍스트) 벤치마크 전반에서 우수한 결과를 달성하며, 강력한 오픈소스 베이스라인을 능가하고 특히 개방형 임상 작업에서 큰 성능 향상을 보여줍니다. 우리의 결과는 포괄적인 보상 신호와 LLM 기반 평가를 통한 개방형 RL이 멀티모달 모델에서 신뢰할 수 있는 의료 추론을 위한 실용적인 경로임을 입증합니다. 학습된 모델, 정제된 데이터셋 및 소스 코드는 https://medix.cvmbzuai.com에서 이용 가능합니다.

English

We introduce MediX-R1, an open-ended Reinforcement Learning (RL) framework for medical multimodal large language models (MLLMs) that enables clinically grounded, free-form answers beyond multiple-choice formats. MediX-R1 fine-tunes a baseline vision-language backbone with Group Based RL and a composite reward tailored for medical reasoning: an LLM-based accuracy reward that judges semantic correctness with a strict YES/NO decision, a medical embedding-based semantic reward to capture paraphrases and terminology variants, and lightweight format and modality rewards that enforce interpretable reasoning and modality recognition. This multi-signal design provides stable, informative feedback for open-ended outputs where traditional verifiable or MCQ-only rewards fall short. To measure progress, we propose a unified evaluation framework for both text-only and image+text tasks that uses a Reference-based LLM-as-judge in place of brittle string-overlap metrics, capturing semantic correctness, reasoning, and contextual alignment. Despite using only sim51K instruction examples, MediX-R1 achieves excellent results across standard medical LLM (text-only) and VLM (image + text) benchmarks, outperforming strong open-source baselines and delivering particularly large gains on open-ended clinical tasks. Our results demonstrate that open-ended RL with comprehensive reward signals and LLM-based evaluation is a practical path toward reliable medical reasoning in multimodal models. Our trained models, curated datasets and source code are available at https://medix.cvmbzuai.com

MediX-R1: 개방형 의료 강화 학습

MediX-R1: Open Ended Medical Reinforcement Learning

초록

Support