SynthRL：通过可验证数据合成扩展视觉推理能力

摘要

通过可验证奖励的强化学习（RLVR）训练的视觉-语言模型（VLMs）在有效扩展测试时计算方面已展现出显著进展。本研究探讨了合成RL数据如何进一步提升RLVR效果。为此，我们提出了SynthRL——一个面向推理训练、可扩展且保证质量的自动数据扩展流程。SynthRL包含三个关键阶段：(1) 选择具有适当分布的种子问题，(2) 在保持原答案的同时，将其增强为更具挑战性的变体，以及(3) 一个确保近乎完美正确性和难度提升的验证阶段。我们的实证实验验证了SynthRL的可扩展性和有效性。应用于MMK12数据集时，SynthRL从约8K种子样本中合成了超过3.3K个可验证的、更具挑战性的问题。使用我们合成数据训练的模型在五个跨领域视觉数学推理基准测试中均取得了一致性提升，相较于仅使用种子数据训练的基线模型，改进尤为显著。特别值得注意的是，深入分析表明，这些增益在最具挑战性的评估样本上更为突出，这凸显了SynthRL在激发更深层次、更复杂推理模式方面的有效性。

English

Vision-language models (VLMs) trained via reinforcement learning with verifiable reward (RLVR) have shown notable progress in scaling test-time compute effectively. In this work, we investigate how synthesized RL data can further improve RLVR. To this end, we propose SynthRL-a scalable and guaranteed pipeline for automatic data scaling in reasoning-oriented RL training. SynthRL comprises three key stages: (1) selecting seed questions with appropriate distribution, (2) augmenting them into more challenging variants while preserving the original answers, and (3) a guaranteed verification stage that ensures near-perfect correctness and difficulty enhancement. Our empirical experiments demonstrate SynthRL's scalability and effectiveness. When applied to the MMK12 dataset, SynthRL synthesizes over 3.3K additional verifiable, challenging questions from approximately 8K seed samples. Models trained with our synthesized data achieve consistent gains across five out-of-domain visual math reasoning benchmarks, with a significant improvement over baseline models trained on seed data alone. Notably, detailed analysis reveals that the gains are more pronounced on the most challenging evaluation samples, highlighting SynthRL's effectiveness in eliciting deeper and more complex reasoning patterns.

SynthRL：通过可验证数据合成扩展视觉推理能力

SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis

摘要

Support