VeriEvol: 検証可能なEvol-Instructによるマルチモーダル数学的推論のスケーリング

要旨

視覚的数学的推論のための強化学習のスケーリングには、単に難しい問題を生成する以上のことが求められる。データ量が増大するにつれて、報酬ラベル自体の信頼性を維持しなければならない。しかし、既存のデータパイプラインはラベラを信頼したまま監視をスケールしており、方策側の手法は基礎となる回答がすでに正しいことを前提としている。本稿では、スケーリングを検証可能なデータ構築問題として捉え、方策を更新する前に、経路固有の進化演算子によって拡張されるプロンプトの難易度と、オフラインの仮説検証による反証によって強制される回答の信頼性という2つの軸を分離する。この考えを具体化したのが、拡張可能な2つのコンポーネントからなる反復的フレームワークVeriEvolである。すなわち、低難易度の画像・問題シードをより難しい画像に基づくプロンプトに書き換える型認識進化モジュールと、複数ソースからの反証がその回答を覆せなかった場合にのみ受け入れる検証器HTV-Agentである。得られた検証済みデータは、量をスケールでき、進化経路や検証器チャネルを追加することで拡張できる。また、既存のGRPOスタイルの強化学習レシピに直接組み込める。5つのベンチマークからなる視覚数学スイートにおいて、進化型SFTデータを1万サンプルから25万サンプルに拡大すると、平均精度が35.42から54.73に向上した。さらに、基盤モデル、SFT初期化、GRPOレシピを固定した状態で、VeriEvolは進化を行わない強化学習ベースラインに対して累積+3.88ポイントの改善を示し、その内訳は進化型プロンプトが+1.82ポイント、HTV-Agent検証器が+2.06ポイントであった。プロンプト、データ、モデル、コード、そして全サンプルの完全な検証器トレースを公開する。これにより、下流の研究は出力を検査するだけでなく、パイプライン全体をスケールし監査できるようになる。

English

Scaling reinforcement learning for visual mathematical reasoning requires more than generating harder questions: as data volume grows, the reward labels themselves must remain reliable. Yet existing data pipelines scale supervision while trusting the labeller, and policy-side methods assume the underlying answers are already correct. We instead treat scaling as a verifiable data-construction problem and decouple two axes before any policy update: prompt difficulty, expanded by route-specific evolution operators, and answer reliability, enforced by offline hypothesis-test falsification. We instantiate this as VeriEvol, an iterative framework with two extensible components: a type-aware evolution module that rewrites low-difficulty image-question seeds into harder, image-grounded prompts; and HTV-Agent, a verifier that accepts an answer only after multi-source counter-evidence has failed to refute it. The resulting verified data scales in volume, extends by adding evolution routes or verifier channels, and plugs directly into existing GRPO-style RL recipes. On a five-benchmark visual-math suite, scaling evolved SFT data from 10K to 250K samples raises the mean accuracy from 35.42 to 54.73; then, with backbone, SFT initialization, and GRPO recipe held fixed, VeriEvol adds a cumulative +3.88 over an un-evolved RL baseline, of which +1.82 comes from evolved prompts and +2.06 from the HTV-Agent verifier. We release the prompts, data, models, code, and the full verifier trace of every sample, so that downstream work can scale and audit the pipeline rather than only inspect its outputs.