数学对象推理:基于策略的奖励建模与测试时聚合
Reasoning over mathematical objects: on-policy reward modeling and test time aggregation
March 19, 2026
作者: Pranjal Aggarwal, Marjan Ghazvininejad, Seungone Kim, Ilia Kulikov, Jack Lanchantin, Xian Li, Tianjian Li, Bo Liu, Graham Neubig, Anaelia Ovalle, Swarnadeep Saha, Sainbayar Sukhbaatar, Sean Welleck, Jason Weston, Chenxi Whitehouse, Adina Williams, Jing Xu, Ping Yu, Weizhe Yuan, Jingyu Zhang, Wenting Zhao
cs.AI
摘要
精确推导数学对象的能力是下游STEM应用(包括数学、物理和化学)的核心需求,这些领域的推理必须最终形成形式化结构表达式。然而,由于自动化评估的便利性,当前语言模型对数学与科学推理的评估严重依赖简化答案格式,如数值或多项选择。本文为提升数学对象推理能力提供三项贡献:(一)构建并发布了用于推导数学对象的训练数据与基准测试集——原理套件;(二)提出了结合强LLM评判器与验证器的训练方案,证明策略内评判器训练能有效提升性能;(三)展示了如何通过策略内训练实现测试时计算的聚合扩展。我们发现Qwen3-235B和o3等强语言模型在原理套件上表现欠佳,而我们的训练方案能在不同LLM骨干网络上带来显著改进,同时提升现有数值与多选任务的成绩,证明了推理能力的跨格式泛化性。
English
The ability to precisely derive mathematical objects is a core requirement for downstream STEM applications, including mathematics, physics, and chemistry, where reasoning must culminate in formally structured expressions. Yet, current LM evaluations of mathematical and scientific reasoning rely heavily on simplified answer formats such as numerical values or multiple choice options due to the convenience of automated assessment. In this paper we provide three contributions for improving reasoning over mathematical objects: (i) we build and release training data and benchmarks for deriving mathematical objects, the Principia suite; (ii) we provide training recipes with strong LLM-judges and verifiers, where we show that on-policy judge training boosts performance; (iii) we show how on-policy training can also be used to scale test-time compute via aggregation. We find that strong LMs such as Qwen3-235B and o3 struggle on Principia, while our training recipes can bring significant improvements over different LLM backbones, while simultaneously improving results on existing numerical and MCQA tasks, demonstrating cross-format generalization of reasoning abilities.