强化学习何时助力医学视觉语言模型?解析视觉、监督微调与强化学习的增益效应
When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains
March 1, 2026
作者: Ahmadreza Jeddi, Kimia Shaban, Negin Baghbanzadeh, Natasha Sharan, Abhishek Moturu, Elham Dolatabadi, Babak Taati
cs.AI
摘要
強化學習(RL)在醫學視覺語言模型(VLM)的後訓練中應用日益廣泛,但目前尚不清楚RL究竟是提升了醫學視覺推理能力,還是主要強化了監督微調(SFT)已誘導的行為。我們通過對視覺、SFT和RL三個維度的控制實驗來解析這些效應:以MedMNIST作為多模態測試平台,通過對比VLM視覺模組與純視覺基線的表現來探測視覺感知能力,利用Accuracy@1與Pass@K的對比量化推理支持度與採樣效率,並評估RL何時能彌合支持度差距及其增益如何跨模態遷移。研究發現,當模型已具備顯著支持度(高Pass@K)時RL效果最佳:它主要通過銳化輸出分佈來提升Acc@1和採樣效率,而SFT則能擴展支持度並使RL發揮作用。基於這些發現,我們提出邊界感知優化方案,並在PMC多選題視覺問答的平衡子集上對OctoMed初始化模型進行RL後訓練,最終在六個醫學VQA基準測試中實現了強勁的平均性能。
English
Reinforcement learning (RL) is increasingly used to post-train medical Vision-Language Models (VLMs), yet it remains unclear whether RL improves medical visual reasoning or mainly sharpens behaviors already induced by supervised fine-tuning (SFT). We present a controlled study that disentangles these effects along three axes: vision, SFT, and RL. Using MedMNIST as a multi-modality testbed, we probe visual perception by benchmarking VLM vision towers against vision-only baselines, quantify reasoning support and sampling efficiency via Accuracy@1 versus Pass@K, and evaluate when RL closes the support gap and how gains transfer across modalities. We find that RL is most effective when the model already has non-trivial support (high Pass@K): it primarily sharpens the output distribution, improving Acc@1 and sampling efficiency, while SFT expands support and makes RL effective. Based on these findings, we propose a boundary-aware recipe and instantiate it by RL post-training an OctoMed-initialized model on a small, balanced subset of PMC multiple-choice VQA, achieving strong average performance across six medical VQA benchmarks.