Abstract—The ability of large language models (LLMs) to generalize across linguistic contexts is a critical aspect of their performance in mathematical reasoning tasks. This paper investigates the linguistic generalizability of test-time scaling, a technique that adjusts model parameters during inference to improve task-specific performance. We conduct experiments across multiple languages and mathematical reasoning benchmarks to evaluate the effectiveness of test-time scaling in diverse linguistic settings. Our results demonstrate that while test-time scaling enhances performance in the language of training, its efficacy diminishes in languages with different syntactic and semantic structures. These findings highlight the importance of linguistic diversity in the development and evaluation of LLMs for mathematical reasoning, suggesting that future work should consider cross-linguistic adaptability as a key metric for model robustness.

Samenvatting

Het opschalen van rekenkracht tijdens de voorbereidingstraining heeft zijn effectiviteit bewezen voor het bereiken van meertaligheid, maar geldt hetzelfde voor het opschalen tijdens de testfase? In dit werk introduceren we MCLM, een meertalige wiskundebenchmark met wedstrijdniveauproblemen in 55 talen. We testen drie methoden voor opschaling tijdens de testfase—Outcome Reward Modeling (ORM), Process Reward Modeling (ORM) en Budget Forcing (BF)—op zowel Qwen2.5-1.5B Math als MR1-1.5B, een meertalig taalmodel dat we hebben getraind voor uitgebreid redeneren. Onze experimenten tonen aan dat het gebruik van Qwen2.5-1.5B Math met ORM een score van 35,8 behaalt op MCLM, terwijl BF op MR1-1.5B een score van 35,2 bereikt. Hoewel "denkende taalmodelen" recentelijk veel aandacht hebben gekregen, constateren we dat hun prestaties vergelijkbaar zijn met traditionele opschalingsmethoden zoals best-of-N wanneer ze worden beperkt tot vergelijkbare niveaus van inferentie-FLOPS. Bovendien levert BF weliswaar een verbetering van 20 punten op voor de Engelse AIME, maar slechts een gemiddelde winst van 1,94 punten over andere talen—een patroon dat consistent is bij de andere opschalingsmethoden die we hebben bestudeerd—wat benadrukt dat opschaling tijdens de testfase mogelijk niet zo effectief generaliseert naar meertalige taken. Om verder onderzoek te bevorderen, maken we MCLM, MR1-1.5B en de evaluatieresultaten openbaar.

English

Scaling pre-training compute has proven effective for achieving mulitlinguality, but does the same hold for test-time scaling? In this work, we introduce MCLM, a multilingual math benchmark featuring competition-level problems in 55 languages. We test three test-time scaling methods-Outcome Reward Modeling (ORM), Process Reward Modeling (ORM), and Budget Forcing (BF)-on both Qwen2.5-1.5B Math and MR1-1.5B, a multilingual LLM we trained for extended reasoning. Our experiments show that using Qwen2.5-1.5B Math with ORM achieves a score of 35.8 on MCLM, while BF on MR1-1.5B attains 35.2. Although "thinking LLMs" have recently garnered significant attention, we find that their performance is comparable to traditional scaling methods like best-of-N once constrained to similar levels of inference FLOPs. Moreover, while BF yields a 20-point improvement on English AIME, it provides only a 1.94-point average gain across other languages-a pattern consistent across the other test-time scaling methods we studied-higlighting that test-time scaling may not generalize as effectively to multilingual tasks. To foster further research, we release MCLM, MR1-1.5B, and evaluation results.

Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning

Samenvatting

Support