JudgeRLVR：先判断后生成的高效推理框架

摘要

基于可验证奖励的强化学习（RLVR）已成为大语言模型推理的标准范式。然而，仅针对最终答案正确性进行优化的做法，往往会使模型陷入漫无目的、冗长的探索，依赖穷举试错策略而非结构化规划来求解。虽然长度惩罚等启发式约束能减少冗余，但常会截断关键推理步骤，导致效率与可验证性之间难以权衡。本文提出判别能力是高效生成的前提：通过学习区分有效解，模型可内化一种能剪枝搜索空间的引导信号。我们推出JudgeRLVR这一“先判别后生成”的双阶段范式：第一阶段训练模型评判含可验证答案的解题响应；第二阶段以判别器初始化模型，通过标准生成式RLVR进行微调。相比使用相同数学领域训练数据的标准RLVR，JudgeRLVR为Qwen3-30B-A3B模型实现了更优的质量-效率平衡——在领域内数学任务上平均准确率提升约3.7分的同时生成长度减少42%；在领域外基准测试中平均准确率提升约4.5分，展现出更强的泛化能力。

English

Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for reasoning in Large Language Models. However, optimizing solely for final-answer correctness often drives models into aimless, verbose exploration, where they rely on exhaustive trial-and-error tactics rather than structured planning to reach solutions. While heuristic constraints like length penalties can reduce verbosity, they often truncate essential reasoning steps, creating a difficult trade-off between efficiency and verification. In this paper, we argue that discriminative capability is a prerequisite for efficient generation: by learning to distinguish valid solutions, a model can internalize a guidance signal that prunes the search space. We propose JudgeRLVR, a two-stage judge-then-generate paradigm. In the first stage, we train the model to judge solution responses with verifiable answers. In the second stage, we fine-tune the same model with vanilla generating RLVR initialized from the judge. Compared to Vanilla RLVR using the same math-domain training data, JudgeRLVR achieves a better quality--efficiency trade-off for Qwen3-30B-A3B: on in-domain math, it delivers about +3.7 points average accuracy gain with -42\% average generation length; on out-of-domain benchmarks, it delivers about +4.5 points average accuracy improvement, demonstrating enhanced generalization.

JudgeRLVR：先判断后生成的高效推理框架

JudgeRLVR: Judge First, Generate Second for Efficient Reasoning

摘要

Support