ViSurf:面向大规模视觉与语言模型的可视化监督与强化微调框架
ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models
October 12, 2025
作者: Yuqi Liu, Liangyu Chen, Jiazhen Liu, Mingkang Zhu, Zhisheng Zhong, Bei Yu, Jiaya Jia
cs.AI
摘要
大型視覺與語言模型(LVLMs)的典型訓練後範式包括監督式微調(SFT)和可驗證獎勵的強化學習(RLVR)。SFT利用外部指導來注入新知識,而RLVR則依賴內部強化來提升推理能力和整體表現。然而,我們的分析顯示,SFT往往導致次優表現,而RLVR在處理超出模型內部知識庫的任務時則顯得力不從心。為解決這些限制,我們提出了ViSurf(視覺監督與強化微調),這是一種統一的訓練後範式,將SFT和RLVR的優勢整合於單一階段中。我們分析了SFT和RLVR目標的推導,以建立ViSurf目標,為這兩種範式提供了一個統一的視角。ViSurf的核心在於將真實標籤注入RLVR的滾動過程中,從而同時提供外部監督和內部強化。此外,我們引入了三種新穎的獎勵控制策略,以穩定並優化訓練過程。在多個不同基準上的廣泛實驗證明了ViSurf的有效性,其表現優於單獨的SFT、RLVR以及兩階段的SFT→RLVR。深入分析進一步支持了這些發現,驗證了ViSurf的推導和設計原則。
English
Typical post-training paradigms for Large Vision-and-Language Models (LVLMs)
include Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable
Rewards (RLVR). SFT leverages external guidance to inject new knowledge,
whereas RLVR utilizes internal reinforcement to enhance reasoning capabilities
and overall performance. However, our analysis reveals that SFT often leads to
sub-optimal performance, while RLVR struggles with tasks that exceed the
model's internal knowledge base. To address these limitations, we propose
ViSurf (Visual Supervised-and-Reinforcement
Fine-Tuning), a unified post-training paradigm that integrates the
strengths of both SFT and RLVR within a single stage. We analyze the derivation
of the SFT and RLVR objectives to establish the ViSurf objective, providing a
unified perspective on these two paradigms. The core of ViSurf involves
injecting ground-truth labels into the RLVR rollouts, thereby providing
simultaneous external supervision and internal reinforcement. Furthermore, we
introduce three novel reward control strategies to stabilize and optimize the
training process. Extensive experiments across several diverse benchmarks
demonstrate the effectiveness of ViSurf, outperforming both individual SFT,
RLVR, and two-stage SFT \textrightarrow RLVR. In-depth analysis corroborates
these findings, validating the derivation and design principles of ViSurf.