視覚言語モデルのためのRLスケーリングの再考：透明性のあるゼロから構築するフレームワークと包括的評価スキーム

要旨

強化学習（RL）は最近、大規模言語モデルの推論能力を向上させる強力な可能性を示しており、現在では視覚言語モデル（VLM）にも積極的に拡張されています。しかし、VLMにおける既存のRLアプリケーションは、再現性とアクセシビリティを妨げる高度に設計されたフレームワークに依存しており、標準化された評価プロトコルが欠如しているため、結果の比較やトレーニングダイナミクスの解釈が困難です。本研究では、VLMにおけるRLのための透明でゼロから構築されたフレームワークを紹介し、複数のモデルとデータセットで検証された最小限でありながら機能的な4ステップのパイプラインを提供します。さらに、トレーニングダイナミクスと反射的行動を評価するための標準化された評価スキームを提案します。視覚推論タスクにおける広範な実験から、重要な実証的知見が明らかになりました：応答の長さはランダムシードに敏感であり、反射は出力の長さと相関し、RLは高品質なデータがあっても教師あり微調整（SFT）を一般化において一貫して上回ります。これらの知見と提案されたフレームワークは、再現可能なベースラインを確立し、RLベースのVLM研究へのより広範な参加を支援することを目指しています。

English

Reinforcement learning (RL) has recently shown strong potential in improving the reasoning capabilities of large language models and is now being actively extended to vision-language models (VLMs). However, existing RL applications in VLMs often rely on heavily engineered frameworks that hinder reproducibility and accessibility, while lacking standardized evaluation protocols, making it difficult to compare results or interpret training dynamics. This work introduces a transparent, from-scratch framework for RL in VLMs, offering a minimal yet functional four-step pipeline validated across multiple models and datasets. In addition, a standardized evaluation scheme is proposed to assess training dynamics and reflective behaviors. Extensive experiments on visual reasoning tasks uncover key empirical findings: response length is sensitive to random seeds, reflection correlates with output length, and RL consistently outperforms supervised fine-tuning (SFT) in generalization, even with high-quality data. These findings, together with the proposed framework, aim to establish a reproducible baseline and support broader engagement in RL-based VLM research.

視覚言語モデルのためのRLスケーリングの再考：透明性のあるゼロから構築するフレームワークと包括的評価スキーム

Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme

要旨

Support