수학적 추론에서 테스트 타임 스케일링의 언어적 일반화 가능성

초록

사전 학습 컴퓨팅 자원의 확장이 다국어 능력 달성에 효과적임은 입증되었지만, 테스트 시점 확장에도 동일한 효과가 적용될까요? 본 연구에서는 55개 언어로 구성된 경쟁 수준의 수학 문제를 포함한 다국어 수학 벤치마크 MCLM을 소개합니다. 우리는 Qwen2.5-1.5B Math와 확장 추론을 위해 학습한 다국어 LLM인 MR1-1.5B에 대해 세 가지 테스트 시점 확장 방법—결과 보상 모델링(ORM), 과정 보상 모델링(ORM), 예산 강제(BF)—를 테스트했습니다. 실험 결과, Qwen2.5-1.5B Math와 ORM을 사용했을 때 MCLM에서 35.8점을 달성한 반면, MR1-1.5B에 BF를 적용했을 때는 35.2점을 기록했습니다. 최근 "사고형 LLM"이 상당한 주목을 받고 있지만, 유사한 수준의 추론 FLOPs로 제한할 경우 전통적인 확장 방법인 best-of-N과 성능이 비슷한 것으로 나타났습니다. 또한, BF는 영어 AIME에서 20점의 향상을 보였지만, 다른 언어에서는 평균 1.94점의 향상만을 제공했는데, 이는 우리가 연구한 다른 테스트 시점 확장 방법에서도 일관되게 관찰된 패턴으로, 테스트 시점 확장이 다국어 작업에 그만큼 효과적으로 일반화되지 않을 수 있음을 시사합니다. 추가 연구를 촉진하기 위해 MCLM, MR1-1.5B 및 평가 결과를 공개합니다.

English

Scaling pre-training compute has proven effective for achieving mulitlinguality, but does the same hold for test-time scaling? In this work, we introduce MCLM, a multilingual math benchmark featuring competition-level problems in 55 languages. We test three test-time scaling methods-Outcome Reward Modeling (ORM), Process Reward Modeling (ORM), and Budget Forcing (BF)-on both Qwen2.5-1.5B Math and MR1-1.5B, a multilingual LLM we trained for extended reasoning. Our experiments show that using Qwen2.5-1.5B Math with ORM achieves a score of 35.8 on MCLM, while BF on MR1-1.5B attains 35.2. Although "thinking LLMs" have recently garnered significant attention, we find that their performance is comparable to traditional scaling methods like best-of-N once constrained to similar levels of inference FLOPs. Moreover, while BF yields a 20-point improvement on English AIME, it provides only a 1.94-point average gain across other languages-a pattern consistent across the other test-time scaling methods we studied-higlighting that test-time scaling may not generalize as effectively to multilingual tasks. To foster further research, we release MCLM, MR1-1.5B, and evaluation results.

수학적 추론에서 테스트 타임 스케일링의 언어적 일반화 가능성

Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning

초록

Support