ChatPaper.aiChatPaper

MediX-R1:開放式醫療強化學習系統

MediX-R1: Open Ended Medical Reinforcement Learning

February 26, 2026
作者: Sahal Shaji Mullappilly, Mohammed Irfan Kurpath, Omair Mohamed, Mohamed Zidan, Fahad Khan, Salman Khan, Rao Anwer, Hisham Cholakkal
cs.AI

摘要

我們推出MediX-R1,這是一個針對醫學多模態大型語言模型(MLLMs)的開放式強化學習(RL)框架,能夠生成基於臨床實證的自由形式答案,突破傳統多選題格式的限制。該框架通過基於群組的強化學習與專為醫學推理設計的複合獎勵機制,對基礎視覺語言骨幹模型進行微調:包含基於LLM的準確性獎勵(通過嚴格的是/否判斷語義正確性)、基於醫學嵌入的語義獎勵(用於捕捉同義表述和術語變體),以及輕量級的格式與模態獎勵(用於強化可解釋推理和模態識別能力)。這種多信號設計為開放式輸出提供了穩定且信息豐富的反饋,有效解決了傳統可驗證或僅限多選題的獎勵機制不足之處。為量化進展,我們提出統一的評估框架,適用於純文本及圖文混合任務,採用基於參考的LLM-as-judge方法替代脆弱的字符串重疊指標,從而全面捕捉語義正確性、推理邏輯和上下文一致性。儘管僅使用51K模擬指令樣本,MediX-R1在標準醫學LLM(純文本)和VLM(圖文混合)基準測試中均表現優異,超越強勁的開源基線模型,並在開放式臨床任務上實現顯著提升。我們的結果表明,結合全面獎勵信號與基於LLM評估的開放式強化學習,是實現多模態模型中可靠醫學推理的可行路徑。訓練後的模型、精選數據集及源代碼已公開於:https://medix.cvmbzuai.com
English
We introduce MediX-R1, an open-ended Reinforcement Learning (RL) framework for medical multimodal large language models (MLLMs) that enables clinically grounded, free-form answers beyond multiple-choice formats. MediX-R1 fine-tunes a baseline vision-language backbone with Group Based RL and a composite reward tailored for medical reasoning: an LLM-based accuracy reward that judges semantic correctness with a strict YES/NO decision, a medical embedding-based semantic reward to capture paraphrases and terminology variants, and lightweight format and modality rewards that enforce interpretable reasoning and modality recognition. This multi-signal design provides stable, informative feedback for open-ended outputs where traditional verifiable or MCQ-only rewards fall short. To measure progress, we propose a unified evaluation framework for both text-only and image+text tasks that uses a Reference-based LLM-as-judge in place of brittle string-overlap metrics, capturing semantic correctness, reasoning, and contextual alignment. Despite using only sim51K instruction examples, MediX-R1 achieves excellent results across standard medical LLM (text-only) and VLM (image + text) benchmarks, outperforming strong open-source baselines and delivering particularly large gains on open-ended clinical tasks. Our results demonstrate that open-ended RL with comprehensive reward signals and LLM-based evaluation is a practical path toward reliable medical reasoning in multimodal models. Our trained models, curated datasets and source code are available at https://medix.cvmbzuai.com
PDF142February 28, 2026