强化学习何时助力医学视觉语言模型？解析视觉、监督微调与强化学习的增益效应

摘要

強化學習（RL）在醫學視覺語言模型（VLM）的後訓練中應用日益廣泛，但目前尚不清楚RL究竟是提升了醫學視覺推理能力，還是主要強化了監督微調（SFT）已誘導的行為。我們通過對視覺、SFT和RL三個維度的控制實驗來解析這些效應：以MedMNIST作為多模態測試平台，通過對比VLM視覺模組與純視覺基線的表現來探測視覺感知能力，利用Accuracy@1與Pass@K的對比量化推理支持度與採樣效率，並評估RL何時能彌合支持度差距及其增益如何跨模態遷移。研究發現，當模型已具備顯著支持度（高Pass@K）時RL效果最佳：它主要通過銳化輸出分佈來提升Acc@1和採樣效率，而SFT則能擴展支持度並使RL發揮作用。基於這些發現，我們提出邊界感知優化方案，並在PMC多選題視覺問答的平衡子集上對OctoMed初始化模型進行RL後訓練，最終在六個醫學VQA基準測試中實現了強勁的平均性能。

English

Reinforcement learning (RL) is increasingly used to post-train medical Vision-Language Models (VLMs), yet it remains unclear whether RL improves medical visual reasoning or mainly sharpens behaviors already induced by supervised fine-tuning (SFT). We present a controlled study that disentangles these effects along three axes: vision, SFT, and RL. Using MedMNIST as a multi-modality testbed, we probe visual perception by benchmarking VLM vision towers against vision-only baselines, quantify reasoning support and sampling efficiency via Accuracy@1 versus Pass@K, and evaluate when RL closes the support gap and how gains transfer across modalities. We find that RL is most effective when the model already has non-trivial support (high Pass@K): it primarily sharpens the output distribution, improving Acc@1 and sampling efficiency, while SFT expands support and makes RL effective. Based on these findings, we propose a boundary-aware recipe and instantiate it by RL post-training an OctoMed-initialized model on a small, balanced subset of PMC multiple-choice VQA, achieving strong average performance across six medical VQA benchmarks.

强化学习何时助力医学视觉语言模型？解析视觉、监督微调与强化学习的增益效应

When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains

摘要

Support