언제 강화학습이 의료 비전-언어 모델에 도움이 될까? 비전, 지도 미세 조정, 강화학습의 효과 분석

초록

강화학습(RL)은 의료 비전-언어 모델(VLM)의 사후 학습에 점점 더 많이 활용되고 있지만, RL이 의료 시각 추론을 실제로 개선하는지, 아니면 지도 미세 조정(SFT)으로 이미 형성된 행동을 주로 세부 조정하는지 여부는 여전히 명확하지 않습니다. 본 연구는 이러한 효과를 시각, SFT, RL이라는 세 가지 축을 따라 분리하여 통제된 연구를 제시합니다. MedMNIST를 다중 모달리티 테스트베드로 활용하여 VLM 비전 타워를 시각 전용 베이스라인과 비교 평가함으로써 시각 인식을 탐구하고, Accuracy@1 대 Pass@K를 통해 추론 지원 범위와 샘플링 효율을 정량화하며, RL이 언제 지원 격차를 해소하는지와 이득이 모드 간에 어떻게 전이되는지 평가합니다. 우리는 RL이 모델이 이미 상당한 지원 범위(높은 Pass@K)를 보유하고 있을 때 가장 효과적임을 발견했습니다. RL은 주로 출력 분포를 세부 조정하여 Acc@1과 샘플링 효율을 향상시키는 반면, SFT는 지원 범위를 확장하고 RL의 효과를 가능하게 합니다. 이러한 발견을 바탕으로 우리는 경계 인식(boundary-aware) 방법론을 제안하며, 이를 OctoMed으로 초기화된 모델을 소규모 균형 잡힌 PMC 객관식 VQA 하위 집합에 대해 RL 사후 학습함으로써 구체화하고, 6개의 의료 VQA 벤치마크 전반에 걸쳐 강력한 평균 성능을 달성합니다.

English

Reinforcement learning (RL) is increasingly used to post-train medical Vision-Language Models (VLMs), yet it remains unclear whether RL improves medical visual reasoning or mainly sharpens behaviors already induced by supervised fine-tuning (SFT). We present a controlled study that disentangles these effects along three axes: vision, SFT, and RL. Using MedMNIST as a multi-modality testbed, we probe visual perception by benchmarking VLM vision towers against vision-only baselines, quantify reasoning support and sampling efficiency via Accuracy@1 versus Pass@K, and evaluate when RL closes the support gap and how gains transfer across modalities. We find that RL is most effective when the model already has non-trivial support (high Pass@K): it primarily sharpens the output distribution, improving Acc@1 and sampling efficiency, while SFT expands support and makes RL effective. Based on these findings, we propose a boundary-aware recipe and instantiate it by RL post-training an OctoMed-initialized model on a small, balanced subset of PMC multiple-choice VQA, achieving strong average performance across six medical VQA benchmarks.

언제 강화학습이 의료 비전-언어 모델에 도움이 될까? 비전, 지도 미세 조정, 강화학습의 효과 분석

When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains

초록

Support