数学对象推理：在线策略奖励建模与测试时聚合

摘要

精確推導數學對象的能力是下游STEM應用（包括數學、物理和化學）的核心需求，這些領域的推理必須以形式化結構表達作為最終呈現。然而，由於自動化評估的便利性，當前語言模型對數學與科學推理的評估嚴重依賴數值或多選題等簡化答案格式。本文為改進數學對象推理提出三項貢獻：（一）構建並發布用於推導數學對象的訓練數據與基準測試集Principia；（二）提出結合強語言模型評判器與驗證器的訓練方案，實證顯示在線策略評判器訓練能有效提升性能；（三）展示如何通過在線策略訓練實現測試階段計算資源的聚合擴展。我們發現Qwen3-235B和o3等強語言模型在Principia基準上表現欠佳，而我們的訓練方案能在不同LLM骨幹上帶來顯著改進，同時提升現有數值題與多選題任務的表現，證明了推理能力具備跨格式泛化特性。

English

The ability to precisely derive mathematical objects is a core requirement for downstream STEM applications, including mathematics, physics, and chemistry, where reasoning must culminate in formally structured expressions. Yet, current LM evaluations of mathematical and scientific reasoning rely heavily on simplified answer formats such as numerical values or multiple choice options due to the convenience of automated assessment. In this paper we provide three contributions for improving reasoning over mathematical objects: (i) we build and release training data and benchmarks for deriving mathematical objects, the Principia suite; (ii) we provide training recipes with strong LLM-judges and verifiers, where we show that on-policy judge training boosts performance; (iii) we show how on-policy training can also be used to scale test-time compute via aggregation. We find that strong LMs such as Qwen3-235B and o3 struggle on Principia, while our training recipes can bring significant improvements over different LLM backbones, while simultaneously improving results on existing numerical and MCQA tasks, demonstrating cross-format generalization of reasoning abilities.

数学对象推理：在线策略奖励建模与测试时聚合

Reasoning over mathematical objects: on-policy reward modeling and test time aggregation

摘要

Support