强化学习何时助力医学视觉语言模型?解析视觉、监督微调与强化学习的增益贡献
When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains
March 1, 2026
作者: Ahmadreza Jeddi, Kimia Shaban, Negin Baghbanzadeh, Natasha Sharan, Abhishek Moturu, Elham Dolatabadi, Babak Taati
cs.AI
摘要
强化学习(RL)在医学视觉语言模型(VLMs)的后训练中应用日益广泛,但其究竟能提升医学视觉推理能力,还是主要强化监督微调(SFT)已诱导的行为仍不明确。我们通过控制变量实验从视觉、SFT和RL三个维度解析这些效应:以MedMNIST作为多模态测试平台,通过对比VLM视觉模块与纯视觉基线的表现来评估视觉感知能力,利用Accuracy@1与Pass@K量化推理支持度与采样效率,并探究RL何时能弥合支持度差距及其跨模态迁移效果。研究发现当模型已具备显著支持度(高Pass@K)时RL最有效:其主要通过锐化输出分布来提升Acc@1和采样效率,而SFT则扩展支持度并为RL生效创造条件。基于这些发现,我们提出边界感知训练方案,通过在PMC多选题VQA的平衡子集上对OctoMed初始化模型进行RL后训练,该方案在六项医学VQA基准测试中均展现出强劲的平均性能。
English
Reinforcement learning (RL) is increasingly used to post-train medical Vision-Language Models (VLMs), yet it remains unclear whether RL improves medical visual reasoning or mainly sharpens behaviors already induced by supervised fine-tuning (SFT). We present a controlled study that disentangles these effects along three axes: vision, SFT, and RL. Using MedMNIST as a multi-modality testbed, we probe visual perception by benchmarking VLM vision towers against vision-only baselines, quantify reasoning support and sampling efficiency via Accuracy@1 versus Pass@K, and evaluate when RL closes the support gap and how gains transfer across modalities. We find that RL is most effective when the model already has non-trivial support (high Pass@K): it primarily sharpens the output distribution, improving Acc@1 and sampling efficiency, while SFT expands support and makes RL effective. Based on these findings, we propose a boundary-aware recipe and instantiate it by RL post-training an OctoMed-initialized model on a small, balanced subset of PMC multiple-choice VQA, achieving strong average performance across six medical VQA benchmarks.