ChatPaper.aiChatPaper

重新思考視覺語言模型的強化學習擴展:一個透明、從零開始的框架與全面評估方案

Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme

April 3, 2025
作者: Yan Ma, Steffi Chern, Xuyang Shen, Yiran Zhong, Pengfei Liu
cs.AI

摘要

強化學習(RL)近期在提升大型語言模型的推理能力方面展現出巨大潛力,並正積極擴展至視覺語言模型(VLMs)。然而,現有的RL在VLMs中的應用往往依賴於高度工程化的框架,這阻礙了可重現性和可訪問性,同時缺乏標準化的評估協議,使得結果比較或訓練動態解釋變得困難。本研究引入了一個透明、從零開始的RL框架,適用於VLMs,提供了一個簡潔但功能齊全的四步流程,並在多個模型和數據集上進行了驗證。此外,提出了一套標準化的評估方案,用於評估訓練動態和反思行為。在視覺推理任務上的大量實驗揭示了關鍵的實證發現:回應長度對隨機種子敏感,反思與輸出長度相關,且RL在泛化能力上始終優於監督微調(SFT),即使在高質量數據的情況下也是如此。這些發現,連同所提出的框架,旨在建立一個可重現的基準,並支持更廣泛地參與基於RL的VLM研究。
English
Reinforcement learning (RL) has recently shown strong potential in improving the reasoning capabilities of large language models and is now being actively extended to vision-language models (VLMs). However, existing RL applications in VLMs often rely on heavily engineered frameworks that hinder reproducibility and accessibility, while lacking standardized evaluation protocols, making it difficult to compare results or interpret training dynamics. This work introduces a transparent, from-scratch framework for RL in VLMs, offering a minimal yet functional four-step pipeline validated across multiple models and datasets. In addition, a standardized evaluation scheme is proposed to assess training dynamics and reflective behaviors. Extensive experiments on visual reasoning tasks uncover key empirical findings: response length is sensitive to random seeds, reflection correlates with output length, and RL consistently outperforms supervised fine-tuning (SFT) in generalization, even with high-quality data. These findings, together with the proposed framework, aim to establish a reproducible baseline and support broader engagement in RL-based VLM research.

Summary

AI-Generated Summary

PDF303April 4, 2025