奧林匹克級數學解題長程推理智能體

摘要

大型語言模型（LLMs）通過可驗證獎勵的強化學習（RLVR）在解決複雜推理任務方面取得了顯著進展。這項進展同樣離不開由可靠驗證器實現的自動化監督。然而，現有的基於結果的驗證器（OVs）無法有效審查思維鏈（CoTs）長推理過程中不可靠的中間步驟。與此同時，現有的基於過程的驗證器（PVs）受限於人工標註成本過高導致的高質量標註數據稀缺，難以在複雜的長思維鏈中可靠檢測錯誤。為此，我們提出基於結果的過程驗證器（OPV），通過驗證長思維鏈中總結性結果的推導過程，實現精確高效的驗證並支持大規模標註。為增強該驗證器的能力，我們採用結合專家標註的迭代式主動學習框架，以較低標註成本逐步提升OPV的驗證能力。具體而言，在每輪迭代中，當前最優OPV最不確定的案例會經專家標註，隨後通過拒絶微調（RFT）和RLVR訓練新一代OPV用於後續輪次。大量實驗證明OPV具有卓越性能與廣泛適用性：在保留測試集\thisbench上以83.1的F1分數刷新現有最佳紀錄，顯著超越Qwen3-Max-Preview等更大規模開源模型的76.3分。此外，OPV能有效檢測合成數據集中的假陽性案例，其判斷與專家評估高度一致。在與策略模型協作時，OPV持續帶來性能提升，例如在AIME2025數據集上，隨著計算預算增加，將DeepSeek-R1-Distill-Qwen-32B的準確率從55.2%提升至73.3%。

English

Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the complex long CoTs, limited by the scarcity of high-quality annotations due to the prohibitive costs of human annotations. Therefore, we propose the Outcome-based Process Verifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation. To empower the proposed verifier, we adopt an iterative active learning framework with expert annotations to progressively improve the verification capability of OPV with fewer annotation costs. Specifically, in each iteration, the most uncertain cases of the current best OPV are annotated and then subsequently used to train a new OPV through Rejection Fine-Tuning (RFT) and RLVR for the next round. Extensive experiments demonstrate OPV's superior performance and broad applicability. It achieves new state-of-the-art results on our held-out \thisbench, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3. Furthermore, OPV effectively detects false positives within synthetic dataset, closely align with expert assessment. When collaborating with policy models, OPV consistently yields performance gains, e.g., raising the accuracy of DeepSeek-R1-Distill-Qwen-32B from 55.2\% to 73.3\% on AIME2025 as the compute budget scales.

奧林匹克級數學解題長程推理智能體

Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving

摘要

Support