SynthRL: 検証可能なデータ合成による視覚的推論のスケーリング

要旨

検証可能な報酬を用いた強化学習（RLVR）で訓練された視覚言語モデル（VLM）は、テスト時の計算リソースを効果的にスケーリングする点で顕著な進展を見せています。本研究では、合成されたRLデータがRLVRをさらに改善する方法を探ります。そのために、推論指向のRLトレーニングにおける自動データスケーリングのためのスケーラブルで保証されたパイプラインであるSynthRLを提案します。SynthRLは3つの主要な段階で構成されます：(1)適切な分布を持つシード質問を選択、(2)元の回答を保持しながらより挑戦的なバリエーションに拡張、(3)ほぼ完璧な正確性と難易度の向上を保証する検証段階です。我々の実証実験は、SynthRLのスケーラビリティと有効性を示しています。MMK12データセットに適用した場合、SynthRLは約8Kのシードサンプルから3.3K以上の検証可能で挑戦的な質問を合成します。我々の合成データで訓練されたモデルは、5つのドメイン外視覚数学推論ベンチマークで一貫した向上を示し、シードデータのみで訓練されたベースラインモデルを大幅に上回りました。特に、詳細な分析により、最も挑戦的な評価サンプルにおいて向上がより顕著であることが明らかになり、SynthRLがより深く複雑な推論パターンを引き出す効果を強調しています。

English

Vision-language models (VLMs) trained via reinforcement learning with verifiable reward (RLVR) have shown notable progress in scaling test-time compute effectively. In this work, we investigate how synthesized RL data can further improve RLVR. To this end, we propose SynthRL-a scalable and guaranteed pipeline for automatic data scaling in reasoning-oriented RL training. SynthRL comprises three key stages: (1) selecting seed questions with appropriate distribution, (2) augmenting them into more challenging variants while preserving the original answers, and (3) a guaranteed verification stage that ensures near-perfect correctness and difficulty enhancement. Our empirical experiments demonstrate SynthRL's scalability and effectiveness. When applied to the MMK12 dataset, SynthRL synthesizes over 3.3K additional verifiable, challenging questions from approximately 8K seed samples. Models trained with our synthesized data achieve consistent gains across five out-of-domain visual math reasoning benchmarks, with a significant improvement over baseline models trained on seed data alone. Notably, detailed analysis reveals that the gains are more pronounced on the most challenging evaluation samples, highlighting SynthRL's effectiveness in eliciting deeper and more complex reasoning patterns.

SynthRL: 検証可能なデータ合成による視覚的推論のスケーリング

SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis

要旨

Support