VeriEvol: 通过可验证的进化指令扩展多模态数学推理
VeriEvol: Scaling Multimodal Mathematical Reasoning via Verifiable Evol-Instruct
June 22, 2026
作者: Haoling Li, Kai Zheng, Jie Wu, Can Xu, Qingfeng Sun, Han Hu, Yujiu Yang
cs.AI
摘要
将强化学习扩展至视觉数学推理,需要的不仅是生成更难的问题:随着数据规模增长,奖励标签本身必须保持可靠。然而,现有数据管线在扩展监督时完全信任标注者,而策略侧方法则假设基础答案已正确。我们转而将扩展视为一个可验证的数据构造问题,并在任何策略更新之前解耦两个维度:提示难度(通过路径特定的演化算子进行扩展)和答案可靠性(通过离线假设检验证伪来强制执行)。我们将其具体化为VeriEvol——一个包含两个可扩展组件的迭代框架:一个类型感知的演化模块,将低难度的图像-问题种子重写为更难的、基于图像的提示;以及HTV-Agent,一个验证器,仅在多源反证无法反驳答案后才接受该答案。由此产生的已验证数据在规模上可扩展,通过添加演化路径或验证器通道来拓展,并且可直接接入现有的GRPO风格的强化学习方案。在一个包含五个基准的视觉数学套件上,将经过演化的SFT数据从10K扩展至250K样本,平均准确率从35.42提升至54.73;随后,在保持骨干网络、SFT初始化和GRPO方案不变的情况下,VeriEvol在未经演化的强化学习基线上累计提升了+3.88,其中+1.82来自演化后的提示,+2.06来自HTV-Agent验证器。我们开源了提示、数据、模型、代码以及每个样本的完整验证器跟踪记录,以便后续工作能够扩展和审计整个管线,而不仅仅检查其输出。
English
Scaling reinforcement learning for visual mathematical reasoning requires more than generating harder questions: as data volume grows, the reward labels themselves must remain reliable. Yet existing data pipelines scale supervision while trusting the labeller, and policy-side methods assume the underlying answers are already correct. We instead treat scaling as a verifiable data-construction problem and decouple two axes before any policy update: prompt difficulty, expanded by route-specific evolution operators, and answer reliability, enforced by offline hypothesis-test falsification. We instantiate this as VeriEvol, an iterative framework with two extensible components: a type-aware evolution module that rewrites low-difficulty image-question seeds into harder, image-grounded prompts; and HTV-Agent, a verifier that accepts an answer only after multi-source counter-evidence has failed to refute it. The resulting verified data scales in volume, extends by adding evolution routes or verifier channels, and plugs directly into existing GRPO-style RL recipes. On a five-benchmark visual-math suite, scaling evolved SFT data from 10K to 250K samples raises the mean accuracy from 35.42 to 54.73; then, with backbone, SFT initialization, and GRPO recipe held fixed, VeriEvol adds a cumulative +3.88 over an un-evolved RL baseline, of which +1.82 comes from evolved prompts and +2.06 from the HTV-Agent verifier. We release the prompts, data, models, code, and the full verifier trace of every sample, so that downstream work can scale and audit the pipeline rather than only inspect its outputs.