REVES: 修正と検証—テスト時スケーリングのための拡張訓練

要旨

逐次修正によるテスト時スケーリングは、大規模言語モデル（LLM）の推論能力を向上させる強力なパラダイムとして登場した。しかし、標準的なポストトレーニング手法は主に単発の目的を最適化しており、多段階推論のダイナミクスとの根本的な不整合を生み出している。最近の研究ではこれをマルチターン強化学習（RL）として扱っているが、従来の手法は多段階の軌跡を直接最適化し、モデルが修正から学習できる中間ステップの高品質な誤りをさらに活用できていない。我々は、オンラインデータ・プロンプト拡張と方策最適化を交互に行う2段階の反復フレームワークを提案する。成功した回復軌跡の中間ステップ（「ニアミス」回答）を分離された修正プロンプトと検証プロンプトに変換することで、本アプローチは効果的な回答変換と誤り識別の両方にトレーニングを集中させる。このアプローチにより、効率的なオフポリシーデータ生成が可能となり、標準的なマルチターンRLと比較して長期的サンプリングの計算オーバーヘッドを削減する。LiveCodeBenchにおいて、公開テストケースをフィードバックとして用いた結果、RLベースラインに対して+6.5ポイント、標準的なマルチターン訓練に対して+4.0ポイントの向上を観測した。コーディング以外でも、本アプローチは円充填問題において従来報告されたSOTA結果に匹敵する一方、最小のベースモデル（4B）を使用し、はるかに大規模な進化的探索システムよりもはるかに少ないロールアウトで実現した。正解検証による数学の結果は、修正能力の向上をさらに確認する。また、n_queensやmini_sudokuなどの分布外の制約充足パズルにも一般化し、ここでは正しさは問題の制約によって完全に定義される。コードはhttps://github.com/yxliu02/REVES.gitで入手可能である。

English

Test-time scaling via sequential revision has emerged as a powerful paradigm for enhancing Large Language Model (LLM) reasoning. However, standard post-training methods primarily optimize single-shot objectives, creating a fundamental misalignment with multi-step inference dynamics. While recent work treats this as multi-turn reinforcement learning (RL), conventional approaches optimize over the multi-step trajectories directly, failing to further exploit the high-quality mistakes in intermediate steps that model can learn from correcting them. We propose a two-stage iterative framework that alternates between online data/prompt augmentation and policy optimization. By converting the intermediate steps (``near-miss'' answers) in the successful recovery trajectories into decoupled revision and verification prompts, our approach concentrates training on both effective answer transformation and error identification. This approach enables efficient off-policy data generation and reduces the computational overhead of long-horizon sampling compared to standard multi-turn RL. On LiveCodeBench, using publicly available test cases as feedback, we observe gains of +6.5 points over the RL baseline and +4.0 points over standard multi-turn training. Beyond coding, our approach matches the previously reported SOTA result on circle packing while using the smallest base model (4B) and far fewer rollouts than the much larger evolutionary search systems. Math results under ground-truth verification further confirm improved correction ability. It also generalizes to out-of-distribution constraint-satisfaction puzzles such as n\_queens and mini\_sudoku, where correctness is defined entirely by problem constraints. Code is available at https://github.com/yxliu02/REVES.git.