REVES：修订与验证——测试时扩展的增强训练

摘要

测试时通过序列修正进行缩放已成为增强大型语言模型推理能力的一种强大范式。然而，标准的后训练方法主要优化单次目标，这与多步推理动态存在根本性不匹配。尽管近期工作将此视为多轮强化学习，但传统方法直接优化多步轨迹，未能进一步利用模型在纠正中间步骤时可以从高质量错误中学习的机会。我们提出了一种两阶段迭代框架，交替进行在线数据/提示增强和策略优化。通过将成功恢复轨迹中的中间步骤（"接近正确答案"）转换为解耦的修正和验证提示，我们的方法将训练集中在有效的答案转换和错误识别上。与标准多轮强化学习相比，这种方法实现了高效的非策略数据生成，并减少了长程采样的计算开销。在LiveCodeBench上，使用公开可用的测试用例作为反馈，我们观察到比强化学习基线提高了+6.5分，比标准多轮训练提高了+4.0分。在编码之外，我们的方法在圆填充问题上达到了此前报道的最优结果，同时使用了最小的基础模型（4B），且比规模大得多的进化搜索系统所需的采样次数少得多。基于真实验证的数学结果进一步证实了修正能力的提升。该方法还泛化到了分布外的约束满足谜题（如n皇后和迷你数独），其中正确性完全由问题约束定义。代码可在 https://github.com/yxliu02/REVES.git 获取。

English

Test-time scaling via sequential revision has emerged as a powerful paradigm for enhancing Large Language Model (LLM) reasoning. However, standard post-training methods primarily optimize single-shot objectives, creating a fundamental misalignment with multi-step inference dynamics. While recent work treats this as multi-turn reinforcement learning (RL), conventional approaches optimize over the multi-step trajectories directly, failing to further exploit the high-quality mistakes in intermediate steps that model can learn from correcting them. We propose a two-stage iterative framework that alternates between online data/prompt augmentation and policy optimization. By converting the intermediate steps (``near-miss'' answers) in the successful recovery trajectories into decoupled revision and verification prompts, our approach concentrates training on both effective answer transformation and error identification. This approach enables efficient off-policy data generation and reduces the computational overhead of long-horizon sampling compared to standard multi-turn RL. On LiveCodeBench, using publicly available test cases as feedback, we observe gains of +6.5 points over the RL baseline and +4.0 points over standard multi-turn training. Beyond coding, our approach matches the previously reported SOTA result on circle packing while using the smallest base model (4B) and far fewer rollouts than the much larger evolutionary search systems. Math results under ground-truth verification further confirm improved correction ability. It also generalizes to out-of-distribution constraint-satisfaction puzzles such as n\_queens and mini\_sudoku, where correctness is defined entirely by problem constraints. Code is available at https://github.com/yxliu02/REVES.git.