强化学习何时助力医学视觉语言模型？解析视觉、监督微调与强化学习的增益贡献

摘要

强化学习（RL）在医学视觉语言模型（VLMs）的后训练中应用日益广泛，但其究竟能提升医学视觉推理能力，还是主要强化监督微调（SFT）已诱导的行为仍不明确。我们通过控制变量实验从视觉、SFT和RL三个维度解析这些效应：以MedMNIST作为多模态测试平台，通过对比VLM视觉模块与纯视觉基线的表现来评估视觉感知能力，利用Accuracy@1与Pass@K量化推理支持度与采样效率，并探究RL何时能弥合支持度差距及其跨模态迁移效果。研究发现当模型已具备显著支持度（高Pass@K）时RL最有效：其主要通过锐化输出分布来提升Acc@1和采样效率，而SFT则扩展支持度并为RL生效创造条件。基于这些发现，我们提出边界感知训练方案，通过在PMC多选题VQA的平衡子集上对OctoMed初始化模型进行RL后训练，该方案在六项医学VQA基准测试中均展现出强劲的平均性能。

English

Reinforcement learning (RL) is increasingly used to post-train medical Vision-Language Models (VLMs), yet it remains unclear whether RL improves medical visual reasoning or mainly sharpens behaviors already induced by supervised fine-tuning (SFT). We present a controlled study that disentangles these effects along three axes: vision, SFT, and RL. Using MedMNIST as a multi-modality testbed, we probe visual perception by benchmarking VLM vision towers against vision-only baselines, quantify reasoning support and sampling efficiency via Accuracy@1 versus Pass@K, and evaluate when RL closes the support gap and how gains transfer across modalities. We find that RL is most effective when the model already has non-trivial support (high Pass@K): it primarily sharpens the output distribution, improving Acc@1 and sampling efficiency, while SFT expands support and makes RL effective. Based on these findings, we propose a boundary-aware recipe and instantiate it by RL post-training an OctoMed-initialized model on a small, balanced subset of PMC multiple-choice VQA, achieving strong average performance across six medical VQA benchmarks.

强化学习何时助力医学视觉语言模型？解析视觉、监督微调与强化学习的增益贡献

When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains

摘要

Support