评判我们无法解决的问题：一种基于结果的免预言机研究方法评估

摘要

近期推理模型的进展表明，生成研究级数学问题的合理解答或许已触手可及，但验证环节仍是瓶颈，消耗着稀缺的专家资源。我们提出假设：一个有意义的解决方案应包含足够的方法层面信息，使其在应用于相关问题时，能比错误方案产生更优的下游性能。基于此思路，我们提出基于结果的效用评估法——一种无需人工标注的评估器，通过检验候选方案在解决相关可验证问题时作为上下文示例的价值来进行评分。我们在自建的研究级数学问题集上评估该方法，每个问题均配有一个专家撰写解答和九个LLM生成解答。值得注意的是，基于结果的效用评估法在排序质量上持续优于奖励模型、生成式奖励模型及LLM评判器。具体而言，在GPT-OSS-120B上，其Acc@1从67.2提升至76.3，AUC从71.4提升至79.6；在GPT-OSS-20B上同样实现AUC大幅提升（从69.0至79.2）。此外，与LLM评判器相比，该方法展现出更大的求解器-评估器差距，即使在底层求解器经常失败的案例中，仍能保持更强的正误区分能力。

English

Recent progress in reasoning models suggests that generating plausible attempts for research-level mathematics may be within reach, but verification remains a bottleneck, consuming scarce expert time. We hypothesize that a meaningful solution should contain enough method-level information that, when applied to a neighborhood of related questions, it should yield better downstream performance than incorrect solutions. Building on this idea, we propose Consequence-Based Utility, an oracle-free evaluator that scores each candidate by testing its value as an in-context exemplar in solving related yet verifiable questions. Our approach is evaluated on an original set of research-level math problems, each paired with one expert-written solution and nine LLM-generated solutions. Notably, Consequence-Based Utility consistently outperforms reward models, generative reward models, and LLM judges on ranking quality. Specifically, for GPT-OSS-120B, it improves Acc@1 from 67.2 to 76.3 and AUC from 71.4 to 79.6, with similarly large AUC gains on GPT-OSS-20B (69.0 to 79.2). Furthermore, compared to LLM-Judges, it also exhibits a larger solver-evaluator gap, maintaining a stronger correct-wrong separation even on instances where the underlying solver often fails to solve.