VeriEvol: 검증 가능한 Evol-Instruct를 통한 멀티모달 수학적 추론의 확장

초록

시각적 수학적 추론을 위한 강화 학습 확장은 단순히 더 어려운 질문을 생성하는 것 이상을 요구한다. 데이터의 양이 증가함에 따라 보상 레이블 자체도 신뢰할 수 있어야 하기 때문이다. 그러나 기존의 데이터 파이프라인은 레이블 작성자를 신뢰하면서 감독을 확장하고, 정책 측면 방법은 기본 답변이 이미 정확하다고 가정한다. 반면 우리는 확장을 검증 가능한 데이터 구성 문제로 간주하고, 정책 업데이트 전에 두 가지 축, 즉 경로별 진화 연산자로 확장되는 프롬프트 난이도와 오프라인 가설 검증 반증으로 강화되는 답변 신뢰성을 분리한다. 이를 VeriEvol로 구현했으며, 이는 두 가지 확장 가능한 구성 요소를 갖춘 반복적 프레임워크이다: 낮은 난이도의 이미지-질문 시드를 더 어렵고 이미지 기반의 프롬프트로 재작성하는 유형 인식 진화 모듈과, 다중 소스 반증 증거가 이를 반박하는 데 실패한 후에만 답변을 수락하는 검증기 HTV-에이전트이다. 이렇게 생성된 검증된 데이터는 양적으로 확장되며, 진화 경로나 검증기 채널을 추가함으로써 확장 가능하고, 기존의 GRPO 스타일 강화 학습 레시피에 직접 적용된다. 다섯 개의 벤치마크 시각적 수학 평가 세트에서, 진화된 지도 미세 조정 데이터를 10K에서 250K 샘플로 확장하면 평균 정확도가 35.42에서 54.73으로 상승한다. 그런 다음, 백본, 지도 미세 조정 초기화, GRPO 레시피를 고정한 상태에서 VeriEvol은 진화되지 않은 강화 학습 기준선 대비 누적 +3.88을 추가하며, 이 중 +1.82는 진화된 프롬프트에서, +2.06은 HTV-에이전트 검증기에서 비롯된다. 우리는 프롬프트, 데이터, 모델, 코드 및 모든 샘플의 전체 검증기 추적을 공개하여, 후속 연구가 출력물만 검사하는 대신 파이프라인을 확장하고 감사할 수 있도록 한다.

English

Scaling reinforcement learning for visual mathematical reasoning requires more than generating harder questions: as data volume grows, the reward labels themselves must remain reliable. Yet existing data pipelines scale supervision while trusting the labeller, and policy-side methods assume the underlying answers are already correct. We instead treat scaling as a verifiable data-construction problem and decouple two axes before any policy update: prompt difficulty, expanded by route-specific evolution operators, and answer reliability, enforced by offline hypothesis-test falsification. We instantiate this as VeriEvol, an iterative framework with two extensible components: a type-aware evolution module that rewrites low-difficulty image-question seeds into harder, image-grounded prompts; and HTV-Agent, a verifier that accepts an answer only after multi-source counter-evidence has failed to refute it. The resulting verified data scales in volume, extends by adding evolution routes or verifier channels, and plugs directly into existing GRPO-style RL recipes. On a five-benchmark visual-math suite, scaling evolved SFT data from 10K to 250K samples raises the mean accuracy from 35.42 to 54.73; then, with backbone, SFT initialization, and GRPO recipe held fixed, VeriEvol adds a cumulative +3.88 over an un-evolved RL baseline, of which +1.82 comes from evolved prompts and +2.06 from the HTV-Agent verifier. We release the prompts, data, models, code, and the full verifier trace of every sample, so that downstream work can scale and audit the pipeline rather than only inspect its outputs.