SynthRL:通过可验证数据合成扩展视觉推理能力
SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis
June 2, 2025
作者: Zijian Wu, Jinjie Ni, Xiangyan Liu, Zichen Liu, Hang Yan, Michael Qizhe Shieh
cs.AI
摘要
通过可验证奖励的强化学习(RLVR)训练的视觉-语言模型(VLMs)在有效扩展测试时计算方面已展现出显著进展。本研究探讨了合成RL数据如何进一步提升RLVR效果。为此,我们提出了SynthRL——一个面向推理训练、可扩展且保证质量的自动数据扩展流程。SynthRL包含三个关键阶段:(1) 选择具有适当分布的种子问题,(2) 在保持原答案的同时,将其增强为更具挑战性的变体,以及(3) 一个确保近乎完美正确性和难度提升的验证阶段。我们的实证实验验证了SynthRL的可扩展性和有效性。应用于MMK12数据集时,SynthRL从约8K种子样本中合成了超过3.3K个可验证的、更具挑战性的问题。使用我们合成数据训练的模型在五个跨领域视觉数学推理基准测试中均取得了一致性提升,相较于仅使用种子数据训练的基线模型,改进尤为显著。特别值得注意的是,深入分析表明,这些增益在最具挑战性的评估样本上更为突出,这凸显了SynthRL在激发更深层次、更复杂推理模式方面的有效性。
English
Vision-language models (VLMs) trained via reinforcement learning with
verifiable reward (RLVR) have shown notable progress in scaling test-time
compute effectively. In this work, we investigate how synthesized RL data can
further improve RLVR. To this end, we propose SynthRL-a scalable and
guaranteed pipeline for automatic data scaling in reasoning-oriented RL
training. SynthRL comprises three key stages: (1) selecting seed questions with
appropriate distribution, (2) augmenting them into more challenging variants
while preserving the original answers, and (3) a guaranteed verification stage
that ensures near-perfect correctness and difficulty enhancement. Our empirical
experiments demonstrate SynthRL's scalability and effectiveness. When applied
to the MMK12 dataset, SynthRL synthesizes over 3.3K additional verifiable,
challenging questions from approximately 8K seed samples. Models trained with
our synthesized data achieve consistent gains across five out-of-domain visual
math reasoning benchmarks, with a significant improvement over baseline models
trained on seed data alone. Notably, detailed analysis reveals that the gains
are more pronounced on the most challenging evaluation samples, highlighting
SynthRL's effectiveness in eliciting deeper and more complex reasoning
patterns.Summary
AI-Generated Summary