수학적 객체에 대한 추론: 온-정책 보상 모델링 및 테스트 시간 집계

초록

수학적 객체를 정확하게 도출하는 능력은 공식적으로 구조화된 표현으로 귀결되어야 하는 수학, 물리학, 화학을 포함한 하위 STEM 응용 분야의 핵심 요구사항입니다. 그러나 현재 수학 및 과학적 추론에 대한 언어 모델 평가는 자동화된 평가의 편의성 때문에 수치 값이나 객관식 옵션과 같은 단순화된 답변 형식에 크게 의존하고 있습니다. 본 논문에서는 수학적 객체에 대한 추론 능력 향상을 위한 세 가지 기여를 제공합니다: (i) 수학적 객체 도출을 위한 훈련 데이터와 벤치마크인 Principia 제품군을 구축 및 공개합니다; (ii) 강력한 LLM 판단기와 검증기를 활용한 훈련 방법론을 제시하며, 온-정책 판단기 훈련이 성능을 향상시킴을 보여줍니다; (iii) 온-정책 훈련이 집합을 통해 테스트 시 연산을 확장하는 데에도 사용될 수 있음을 보여줍니다. 우리는 Qwen3-235B 및 o3와 같은 강력한 언어 모델들이 Principia에서 어려움을 겪는 반면, 우리의 훈련 방법론이 서로 다른 LLM 백본에 걸쳐 상당한 개선을 가져오고 기존의 수치 및 객관식 문제에서도 결과를 동시에 향상시켜 추론 능력의 교차 형식 일반화를 입증함을 발견했습니다.

English

The ability to precisely derive mathematical objects is a core requirement for downstream STEM applications, including mathematics, physics, and chemistry, where reasoning must culminate in formally structured expressions. Yet, current LM evaluations of mathematical and scientific reasoning rely heavily on simplified answer formats such as numerical values or multiple choice options due to the convenience of automated assessment. In this paper we provide three contributions for improving reasoning over mathematical objects: (i) we build and release training data and benchmarks for deriving mathematical objects, the Principia suite; (ii) we provide training recipes with strong LLM-judges and verifiers, where we show that on-policy judge training boosts performance; (iii) we show how on-policy training can also be used to scale test-time compute via aggregation. We find that strong LMs such as Qwen3-235B and o3 struggle on Principia, while our training recipes can bring significant improvements over different LLM backbones, while simultaneously improving results on existing numerical and MCQA tasks, demonstrating cross-format generalization of reasoning abilities.

수학적 객체에 대한 추론: 온-정책 보상 모델링 및 테스트 시간 집계

Reasoning over mathematical objects: on-policy reward modeling and test time aggregation

초록

Support