SynthRL:透過可驗證的數據合成擴展視覺推理能力
SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis
June 2, 2025
作者: Zijian Wu, Jinjie Ni, Xiangyan Liu, Zichen Liu, Hang Yan, Michael Qizhe Shieh
cs.AI
摘要
通過可驗證獎勵的強化學習(RLVR)訓練的視覺語言模型(VLMs)在有效擴展測試時計算方面取得了顯著進展。在本研究中,我們探討了如何利用合成的RL數據進一步提升RLVR的效果。為此,我們提出了SynthRL——一個可擴展且具備保證的管道,用於在推理導向的RL訓練中自動擴展數據。SynthRL包含三個關鍵階段:(1)選擇具有適當分佈的種子問題,(2)將其增強為更具挑戰性的變體,同時保留原始答案,以及(3)一個保證驗證階段,確保近乎完美的正確性和難度提升。我們的實證實驗展示了SynthRL的可擴展性和有效性。當應用於MMK12數據集時,SynthRL從約8K種子樣本中合成了超過3.3K個額外的可驗證、具挑戰性的問題。使用我們合成數據訓練的模型在五個域外視覺數學推理基準測試中均取得了穩定的增益,相比僅使用種子數據訓練的基線模型有顯著提升。值得注意的是,詳細分析表明,在最具挑戰性的評估樣本上,增益更為顯著,這凸顯了SynthRL在激發更深層次和更複雜推理模式方面的有效性。
English
Vision-language models (VLMs) trained via reinforcement learning with
verifiable reward (RLVR) have shown notable progress in scaling test-time
compute effectively. In this work, we investigate how synthesized RL data can
further improve RLVR. To this end, we propose SynthRL-a scalable and
guaranteed pipeline for automatic data scaling in reasoning-oriented RL
training. SynthRL comprises three key stages: (1) selecting seed questions with
appropriate distribution, (2) augmenting them into more challenging variants
while preserving the original answers, and (3) a guaranteed verification stage
that ensures near-perfect correctness and difficulty enhancement. Our empirical
experiments demonstrate SynthRL's scalability and effectiveness. When applied
to the MMK12 dataset, SynthRL synthesizes over 3.3K additional verifiable,
challenging questions from approximately 8K seed samples. Models trained with
our synthesized data achieve consistent gains across five out-of-domain visual
math reasoning benchmarks, with a significant improvement over baseline models
trained on seed data alone. Notably, detailed analysis reveals that the gains
are more pronounced on the most challenging evaluation samples, highlighting
SynthRL's effectiveness in eliciting deeper and more complex reasoning
patterns.