1B LLMが405B LLMを超えることは可能か？計算最適なテスト時間スケーリングの再考

要旨

テスト時間スケーリング（TTS）は、推論フェーズ中に追加の計算を使用することで、大規模言語モデル（LLMs）の性能を向上させる重要な手法です。ただし、現在の研究では、方針モデル、プロセス報酬モデル（PRMs）、および問題の難易度がTTSにどのように影響するかを体系的に分析していません。この分析の欠如は、TTS手法の理解と実用性を制限しています。本論文では、次の2つの中心的な問いに焦点を当てます：（1）異なる方針モデル、PRMs、および問題の難易度にわたるテスト時間計算をスケーリングするための最適なアプローチは何か？（2）拡張された計算がLLMsの性能を複雑なタスクでどの程度向上させることができ、このアプローチにより小さな言語モデルが大きなものを上回ることができるか？MATH-500および難解なAIME24タスクに関する包括的な実験を通じて、以下の観察結果が得られました：（1）計算最適なTTS戦略は、方針モデル、PRM、および問題の難易度の選択に大きく依存しています。（2）計算最適なTTS戦略を使用すると、極めて小さな方針モデルが大きなモデルを上回ることがあります。例えば、1B LLMはMATH-500で405B LLMを上回ることができます。さらに、MATH-500およびAIME24の両方で、0.5B LLMはGPT-4oを上回り、3B LLMは405B LLMを上回り、7B LLMはo1およびDeepSeek-R1を上回りますが、推論効率が高くなります。これらの知見は、TTS戦略を各タスクとモデルの特性に適応させることの重要性を示し、TTSがLLMsの推論能力を向上させる有望な手法であることを示しています。

English

Test-Time Scaling (TTS) is an important method for improving the performance of Large Language Models (LLMs) by using additional computation during the inference phase. However, current studies do not systematically analyze how policy models, Process Reward Models (PRMs), and problem difficulty influence TTS. This lack of analysis limits the understanding and practical use of TTS methods. In this paper, we focus on two core questions: (1) What is the optimal approach to scale test-time computation across different policy models, PRMs, and problem difficulty levels? (2) To what extent can extended computation improve the performance of LLMs on complex tasks, and can smaller language models outperform larger ones through this approach? Through comprehensive experiments on MATH-500 and challenging AIME24 tasks, we have the following observations: (1) The compute-optimal TTS strategy is highly dependent on the choice of policy model, PRM, and problem difficulty. (2) With our compute-optimal TTS strategy, extremely small policy models can outperform larger models. For example, a 1B LLM can exceed a 405B LLM on MATH-500. Moreover, on both MATH-500 and AIME24, a 0.5B LLM outperforms GPT-4o, a 3B LLM surpasses a 405B LLM, and a 7B LLM beats o1 and DeepSeek-R1, while with higher inference efficiency. These findings show the significance of adapting TTS strategies to the specific characteristics of each task and model and indicate that TTS is a promising approach for enhancing the reasoning abilities of LLMs.

1B LLMが405B LLMを超えることは可能か？計算最適なテスト時間スケーリングの再考

Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling

要旨

Support