評判無解之題：基於結果的無預言機研究級數學評估方法

摘要

近期推理模型的進展表明，生成研究級數學問題的合理解答可能已觸手可及，但驗證環節仍是瓶頸，消耗著稀缺的專家資源。我們提出假設：一個有意義的解決方案應包含足夠的方法層級資訊，使其在應用於相關問題族時，能比錯誤方案產生更優異的下游表現。基於此理念，我們提出「基於後效的效用評估法」——一種無需人工標註的評估器，通過檢驗候選方案在解決相關可驗證問題時作為上下文範例的價值來進行評分。我們的方法在自建的研究級數學問題集上進行評估，每道題目均配備一個專家撰寫的解答與九個LLM生成的解答。值得注意的是，基於後效的效用評估法在排序品質上持續優於獎勵模型、生成式獎勵模型及LLM評判器。具體而言，在GPT-OSS-120B模型上，其Acc@1從67.2提升至76.3，AUC從71.4提升至79.6；在GPT-OSS-20B模型上同樣實現AUC大幅增長（從69.0至79.2）。此外，相較於LLM評判器，該方法還展現出更大的求解器-評估器差距，即使在底層求解器經常失敗的實例中，仍能保持更強的正誤區分能力。

English

Recent progress in reasoning models suggests that generating plausible attempts for research-level mathematics may be within reach, but verification remains a bottleneck, consuming scarce expert time. We hypothesize that a meaningful solution should contain enough method-level information that, when applied to a neighborhood of related questions, it should yield better downstream performance than incorrect solutions. Building on this idea, we propose Consequence-Based Utility, an oracle-free evaluator that scores each candidate by testing its value as an in-context exemplar in solving related yet verifiable questions. Our approach is evaluated on an original set of research-level math problems, each paired with one expert-written solution and nine LLM-generated solutions. Notably, Consequence-Based Utility consistently outperforms reward models, generative reward models, and LLM judges on ranking quality. Specifically, for GPT-OSS-120B, it improves Acc@1 from 67.2 to 76.3 and AUC from 71.4 to 79.6, with similarly large AUC gains on GPT-OSS-20B (69.0 to 79.2). Furthermore, compared to LLM-Judges, it also exhibits a larger solver-evaluator gap, maintaining a stronger correct-wrong separation even on instances where the underlying solver often fails to solve.

評判無解之題：基於結果的無預言機研究級數學評估方法

Judging What We Cannot Solve: A Consequence-Based Approach for Oracle-Free Evaluation of Research-Level Math

摘要

Support