MediX-R1:开放式医学强化学习
MediX-R1: Open Ended Medical Reinforcement Learning
February 26, 2026
作者: Sahal Shaji Mullappilly, Mohammed Irfan Kurpath, Omair Mohamed, Mohamed Zidan, Fahad Khan, Salman Khan, Rao Anwer, Hisham Cholakkal
cs.AI
摘要
我们推出MediX-R1,这是一个面向医学多模态大语言模型(MLLMs)的开放式强化学习框架,能够生成基于临床依据的自由形式答案,突破传统多选题形式的限制。该框架通过分组强化学习对基线视觉语言主干进行微调,并采用专为医学推理设计的复合奖励机制:基于大语言模型的准确性奖励通过严格的是/否判断评估语义正确性;基于医学嵌入的语义奖励用于捕捉同义表述和术语变体;轻量级的格式与模态奖励则用于强化可解释推理和模态识别能力。这种多信号设计为开放式输出提供了稳定且信息丰富的反馈机制,弥补了传统可验证或仅限多选题奖励机制的不足。为量化进展,我们提出了统一评估框架,适用于纯文本及图文混合任务,采用基于参考的大语言模型作为评判者替代脆弱的字符串重叠指标,全面评估语义正确性、推理能力和上下文对齐度。尽管仅使用5.1万条指令样本,MediX-R1在标准医学大语言模型(纯文本)和视觉语言模型(图文混合)基准测试中均取得优异表现,超越主流开源基线模型,尤其在开放式临床任务上实现显著提升。我们的研究表明,结合综合奖励信号与大语言模型评估的开放式强化学习,是实现多模态模型可靠医学推理的可行路径。训练模型、精选数据集及源代码已发布于https://medix.cvmbzuai.com。
English
We introduce MediX-R1, an open-ended Reinforcement Learning (RL) framework for medical multimodal large language models (MLLMs) that enables clinically grounded, free-form answers beyond multiple-choice formats. MediX-R1 fine-tunes a baseline vision-language backbone with Group Based RL and a composite reward tailored for medical reasoning: an LLM-based accuracy reward that judges semantic correctness with a strict YES/NO decision, a medical embedding-based semantic reward to capture paraphrases and terminology variants, and lightweight format and modality rewards that enforce interpretable reasoning and modality recognition. This multi-signal design provides stable, informative feedback for open-ended outputs where traditional verifiable or MCQ-only rewards fall short. To measure progress, we propose a unified evaluation framework for both text-only and image+text tasks that uses a Reference-based LLM-as-judge in place of brittle string-overlap metrics, capturing semantic correctness, reasoning, and contextual alignment. Despite using only sim51K instruction examples, MediX-R1 achieves excellent results across standard medical LLM (text-only) and VLM (image + text) benchmarks, outperforming strong open-source baselines and delivering particularly large gains on open-ended clinical tasks. Our results demonstrate that open-ended RL with comprehensive reward signals and LLM-based evaluation is a practical path toward reliable medical reasoning in multimodal models. Our trained models, curated datasets and source code are available at https://medix.cvmbzuai.com