RLはいつ医療VLMを支援するか？視覚、SFT、RLの効果を分離して検証する

要旨

強化学習（RL）は医療分野の視覚言語モデル（VLM）の事後学習にますます利用されているが、RLが医療視覚推論を実際に改善するのか、それとも教師ありファインチューニング（SFT）によって既に誘導された振る舞いを主に鋭くするだけなのかは不明瞭である。本研究では、視覚、SFT、RLという3つの軸に沿ってこれらの効果を分離した制御実験を提示する。マルチモーダルテストベッドとしてMedMNISTを用い、VLMのビジョンタワーを視覚のみのベースラインと比較することで視覚知覚を評価し、Accuracy@1対Pass@Kにより推論サポートとサンプリング効率を定量化し、RLがいつサポートギャップを埋め、その利益がどのようにモダリティ間で転移するかを評価する。我々は、RLがモデルが既に無視できないサポート（高いPass@K）を持っている場合に最も効果的であることを発見した：RLは主に出力分布を鋭くし、Acc@1とサンプリング効率を改善する一方、SFTはサポートを拡大し、RLを効果的にする。これらの知見に基づき、我々は境界認識型のレシピを提案し、OctoMedで初期化したモデンをPMC多肢選択VQAの小規模で均衡の取れたサブセットでRL事後学習することによりこれを具体化し、6つの医療VQAベンチマークで強力な平均性能を達成した。

English

Reinforcement learning (RL) is increasingly used to post-train medical Vision-Language Models (VLMs), yet it remains unclear whether RL improves medical visual reasoning or mainly sharpens behaviors already induced by supervised fine-tuning (SFT). We present a controlled study that disentangles these effects along three axes: vision, SFT, and RL. Using MedMNIST as a multi-modality testbed, we probe visual perception by benchmarking VLM vision towers against vision-only baselines, quantify reasoning support and sampling efficiency via Accuracy@1 versus Pass@K, and evaluate when RL closes the support gap and how gains transfer across modalities. We find that RL is most effective when the model already has non-trivial support (high Pass@K): it primarily sharpens the output distribution, improving Acc@1 and sampling efficiency, while SFT expands support and makes RL effective. Based on these findings, we propose a boundary-aware recipe and instantiate it by RL post-training an OctoMed-initialized model on a small, balanced subset of PMC multiple-choice VQA, achieving strong average performance across six medical VQA benchmarks.

RLはいつ医療VLMを支援するか？視覚、SFT、RLの効果を分離して検証する

When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains

要旨

Support