數學推理中測試時縮放的語言通用性
Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning
February 24, 2025
作者: Guijin Son, Jiwoo Hong, Hyunwoo Ko, James Thorne
cs.AI
摘要
擴大預訓練計算量已被證明對實現多語言能力有效,但同樣的方法在測試時擴展是否也適用?在本研究中,我們引入了MCLM,這是一個包含55種語言競賽級數學題目的多語言數學基準。我們在Qwen2.5-1.5B Math和我們為擴展推理訓練的多語言大語言模型MR1-1.5B上測試了三種測試時擴展方法——結果獎勵建模(ORM)、過程獎勵建模(PRM)以及預算強制(BF)。實驗結果顯示,使用Qwen2.5-1.5B Math搭配ORM在MCLM上獲得了35.8分,而MR1-1.5B搭配BF則達到了35.2分。儘管“思考型大語言模型”近期獲得了大量關注,但我們發現,當推理FLOPs被限制在相似水平時,其表現與傳統的擴展方法如best-of-N相當。此外,雖然BF在英語AIME上帶來了20分的提升,但在其他語言上僅平均提升了1.94分——這一模式在我們研究的其他測試時擴展方法中也保持一致——這表明測試時擴展可能無法同樣有效地泛化到多語言任務上。為了促進進一步研究,我們公開了MCLM、MR1-1.5B以及評估結果。
English
Scaling pre-training compute has proven effective for achieving
mulitlinguality, but does the same hold for test-time scaling? In this work, we
introduce MCLM, a multilingual math benchmark featuring competition-level
problems in 55 languages. We test three test-time scaling methods-Outcome
Reward Modeling (ORM), Process Reward Modeling (ORM), and Budget Forcing
(BF)-on both Qwen2.5-1.5B Math and MR1-1.5B, a multilingual LLM we trained for
extended reasoning. Our experiments show that using Qwen2.5-1.5B Math with ORM
achieves a score of 35.8 on MCLM, while BF on MR1-1.5B attains 35.2. Although
"thinking LLMs" have recently garnered significant attention, we find that
their performance is comparable to traditional scaling methods like best-of-N
once constrained to similar levels of inference FLOPs. Moreover, while BF
yields a 20-point improvement on English AIME, it provides only a 1.94-point
average gain across other languages-a pattern consistent across the other
test-time scaling methods we studied-higlighting that test-time scaling may not
generalize as effectively to multilingual tasks. To foster further research, we
release MCLM, MR1-1.5B, and evaluation results.Summary
AI-Generated Summary