数学的推論におけるテスト時スケーリングの言語的汎化性

要旨

事前学習の計算リソースをスケールアップすることが多言語化の達成に有効であることは証明されていますが、テスト時のスケーリングについても同じことが言えるでしょうか？本研究では、55言語の競技レベルの数学問題を特徴とする多言語数学ベンチマークMCLMを紹介します。私たちは、Qwen2.5-1.5B Mathと、拡張推論のためにトレーニングした多言語LLMであるMR1-1.5Bの両方に対して、3つのテスト時スケーリング手法——Outcome Reward Modeling (ORM)、Process Reward Modeling (ORM)、およびBudget Forcing (BF)——をテストしました。実験の結果、Qwen2.5-1.5B MathにORMを適用するとMCLMで35.8のスコアを達成し、MR1-1.5BにBFを適用すると35.2のスコアを達成しました。最近「思考型LLM」が注目を集めていますが、推論FLOPsが同程度に制約された場合、その性能はbest-of-Nのような従来のスケーリング手法と同等であることがわかりました。さらに、BFは英語のAIMEでは20ポイントの改善をもたらしますが、他の言語では平均1.94ポイントの向上しか見られませんでした——これは私たちが研究した他のテスト時スケーリング手法でも一貫したパターンです——これは、テスト時スケーリングが多言語タスクに同じように効果的に一般化しない可能性を示唆しています。さらなる研究を促進するため、MCLM、MR1-1.5B、および評価結果を公開します。

English

Scaling pre-training compute has proven effective for achieving mulitlinguality, but does the same hold for test-time scaling? In this work, we introduce MCLM, a multilingual math benchmark featuring competition-level problems in 55 languages. We test three test-time scaling methods-Outcome Reward Modeling (ORM), Process Reward Modeling (ORM), and Budget Forcing (BF)-on both Qwen2.5-1.5B Math and MR1-1.5B, a multilingual LLM we trained for extended reasoning. Our experiments show that using Qwen2.5-1.5B Math with ORM achieves a score of 35.8 on MCLM, while BF on MR1-1.5B attains 35.2. Although "thinking LLMs" have recently garnered significant attention, we find that their performance is comparable to traditional scaling methods like best-of-N once constrained to similar levels of inference FLOPs. Moreover, while BF yields a 20-point improvement on English AIME, it provides only a 1.94-point average gain across other languages-a pattern consistent across the other test-time scaling methods we studied-higlighting that test-time scaling may not generalize as effectively to multilingual tasks. To foster further research, we release MCLM, MR1-1.5B, and evaluation results.

数学的推論におけるテスト時スケーリングの言語的汎化性

Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning

要旨

Support