REVES：修訂與驗證增強的測試時擴展訓練

摘要

測試時透過序列修正進行規模擴展，已成為增強大型語言模型推理能力的強大範式。然而，標準的訓練後方法主要優化一次性目標，這與多步推理動態產生了根本性的不對齊。雖然近期研究將其視為多輪強化學習，但傳統方法直接優化多步驟軌跡，未能進一步利用模型可從中學習修正的高品質中間步驟錯誤。我們提出一個兩階段迭代框架，在線上數據/提示增強與策略優化之間交替進行。通過將成功恢復軌跡中的中間步驟（「接近正確」的答案）轉換為解耦合的修正與驗證提示，我們的方法專注於同時訓練有效的答案轉換與錯誤識別。此方法能實現高效的離策略數據生成，並相較於標準多輪強化學習，減少了長時域採樣的計算開銷。在LiveCodeBench上，使用公開測試用例作為反饋，我們觀察到相較於強化學習基線提升+6.5分，相較於標準多輪訓練提升+4.0分。除程式碼領域外，我們的方法在圓形排列問題上達到了先前報告的SOTA結果，同時使用最小的基礎模型（4B）且滾動次數遠少於大得多的演化搜尋系統。在真實驗證下的數學結果進一步證實了修正能力的提升。該方法也能泛化至分佈外的約束滿足難題，如n皇后與迷你數獨，其正確性完全由問題約束定義。程式碼已公開於https://github.com/yxliu02/REVES.git。

English

Test-time scaling via sequential revision has emerged as a powerful paradigm for enhancing Large Language Model (LLM) reasoning. However, standard post-training methods primarily optimize single-shot objectives, creating a fundamental misalignment with multi-step inference dynamics. While recent work treats this as multi-turn reinforcement learning (RL), conventional approaches optimize over the multi-step trajectories directly, failing to further exploit the high-quality mistakes in intermediate steps that model can learn from correcting them. We propose a two-stage iterative framework that alternates between online data/prompt augmentation and policy optimization. By converting the intermediate steps (``near-miss'' answers) in the successful recovery trajectories into decoupled revision and verification prompts, our approach concentrates training on both effective answer transformation and error identification. This approach enables efficient off-policy data generation and reduces the computational overhead of long-horizon sampling compared to standard multi-turn RL. On LiveCodeBench, using publicly available test cases as feedback, we observe gains of +6.5 points over the RL baseline and +4.0 points over standard multi-turn training. Beyond coding, our approach matches the previously reported SOTA result on circle packing while using the smallest base model (4B) and far fewer rollouts than the much larger evolutionary search systems. Math results under ground-truth verification further confirm improved correction ability. It also generalizes to out-of-distribution constraint-satisfaction puzzles such as n\_queens and mini\_sudoku, where correctness is defined entirely by problem constraints. Code is available at https://github.com/yxliu02/REVES.git.