ChatPaper.aiChatPaper

強化學習使多模態大語言模型比監督微調看得更清晰。

RL makes MLLMs see better than SFT

October 18, 2025
作者: Junha Song, Sangdoo Yun, Dongyoon Han, Jaegul Choo, Byeongho Heo
cs.AI

摘要

在多模态语言模型(MLLM)研究中,一个主导性假设认为其性能主要继承自大型语言模型(LLM)骨干,这归因于其庞大的参数规模和卓越的能力。这一假设导致了对视觉编码器理解的空白,而视觉编码器决定了MLLM如何感知图像。最近,MLLM训练范式从监督微调(SFT)向强化学习(RL)的转变,放大了这一忽视——即对这类训练如何重塑视觉编码器及MLLM本身的显著缺乏分析。为应对此问题,我们首先探究了训练策略对MLLM的影响,发现RL在强视觉相关的视觉问答(VQA)基准测试中明显优于SFT。受此启发,我们通过一系列多样且深入的实验,从ImageNet分类与分割到梯度可视化,对MLLM的视觉编码器进行了关键却未充分探索的分析。我们的结果表明,MLLM的训练后策略(即SFT或RL)不仅导致MLLM下游任务的不同结果,还从根本上重塑了MLLM的底层视觉表征。具体而言,我们研究的关键发现是,与SFT相比,RL能产生更强且定位更精确的视觉表征,从而提升了MLLM视觉编码器的能力。随后,我们将这些发现提炼为构建MLLM强视觉编码器的简单方案——偏好指导视觉优化(PIVOT)。当PIVOT训练的视觉编码器整合进MLLM时,尽管其计算成本不到标准视觉预训练的1%,却能超越更大规模、训练更充分的对手。这一结果为推进MLLM视觉骨干的发展开辟了一条高效且有效的路径。项目页面详见https://june-page.github.io/pivot/。
English
A dominant assumption in Multimodal Language Model (MLLM) research is that its performance is largely inherited from the LLM backbone, given its immense parameter scale and remarkable capabilities. This has created a void in the understanding of the vision encoder, which determines how MLLMs perceive images. The recent shift in MLLM training paradigms, from Supervised Finetuning (SFT) to Reinforcement Learning (RL), magnifies this oversight-namely, the significant lack of analysis on how such training reshapes the vision encoder as well as the MLLM. To address this, we first investigate the impact of training strategies on MLLMs, where RL shows a clear advantage over SFT in strongly vision-related VQA benchmarks. Motivated by this, we conduct a critical yet under-explored analysis of the vision encoder of MLLMs through diverse and in-depth experiments, ranging from ImageNet classification and segmentation to gradient visualization. Our results demonstrate that MLLM's post-training strategy (i.e., SFT or RL) not only leads to distinct outcomes on MLLM downstream tasks, but also fundamentally reshapes MLLM's underlying visual representations. Specifically, the key finding of our study is that RL produces stronger and precisely localized visual representations compared to SFT, boosting the ability of the vision encoder for MLLM. We then reframe our findings into a simple recipe for building strong vision encoders for MLLMs, Preference-Instructed Vision OpTimization (PIVOT). When integrated into MLLMs, a PIVOT-trained vision encoder outperforms even larger and more heavily-trained counterparts, despite requiring less than 1% of the computational cost of standard vision pretraining. This result opens an effective and efficient path for advancing the vision backbones of MLLMs. Project page available at https://june-page.github.io/pivot/
PDF392October 21, 2025