數學推理中測試時縮放的語言通用性

摘要

擴大預訓練計算量已被證明對實現多語言能力有效，但同樣的方法在測試時擴展是否也適用？在本研究中，我們引入了MCLM，這是一個包含55種語言競賽級數學題目的多語言數學基準。我們在Qwen2.5-1.5B Math和我們為擴展推理訓練的多語言大語言模型MR1-1.5B上測試了三種測試時擴展方法——結果獎勵建模（ORM）、過程獎勵建模（PRM）以及預算強制（BF）。實驗結果顯示，使用Qwen2.5-1.5B Math搭配ORM在MCLM上獲得了35.8分，而MR1-1.5B搭配BF則達到了35.2分。儘管“思考型大語言模型”近期獲得了大量關注，但我們發現，當推理FLOPs被限制在相似水平時，其表現與傳統的擴展方法如best-of-N相當。此外，雖然BF在英語AIME上帶來了20分的提升，但在其他語言上僅平均提升了1.94分——這一模式在我們研究的其他測試時擴展方法中也保持一致——這表明測試時擴展可能無法同樣有效地泛化到多語言任務上。為了促進進一步研究，我們公開了MCLM、MR1-1.5B以及評估結果。

English

Scaling pre-training compute has proven effective for achieving mulitlinguality, but does the same hold for test-time scaling? In this work, we introduce MCLM, a multilingual math benchmark featuring competition-level problems in 55 languages. We test three test-time scaling methods-Outcome Reward Modeling (ORM), Process Reward Modeling (ORM), and Budget Forcing (BF)-on both Qwen2.5-1.5B Math and MR1-1.5B, a multilingual LLM we trained for extended reasoning. Our experiments show that using Qwen2.5-1.5B Math with ORM achieves a score of 35.8 on MCLM, while BF on MR1-1.5B attains 35.2. Although "thinking LLMs" have recently garnered significant attention, we find that their performance is comparable to traditional scaling methods like best-of-N once constrained to similar levels of inference FLOPs. Moreover, while BF yields a 20-point improvement on English AIME, it provides only a 1.94-point average gain across other languages-a pattern consistent across the other test-time scaling methods we studied-higlighting that test-time scaling may not generalize as effectively to multilingual tasks. To foster further research, we release MCLM, MR1-1.5B, and evaluation results.

數學推理中測試時縮放的語言通用性

Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning

摘要

Support