1B LLM能否超越405B LLM?重新思考計算效能最佳化的測試時間擴展
Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling
February 10, 2025
作者: Runze Liu, Junqi Gao, Jian Zhao, Kaiyan Zhang, Xiu Li, Biqing Qi, Wanli Ouyang, Bowen Zhou
cs.AI
摘要
測試時間擴展(TTS)是通過在推論階段使用額外計算來提高大型語言模型(LLMs)性能的重要方法。然而,目前的研究並未系統分析策略模型、處理獎勵模型(PRMs)和問題難度如何影響TTS。這種分析缺乏限制了對TTS方法的理解和實際應用。本文聚焦於兩個核心問題:(1)如何跨不同策略模型、PRMs和問題難度水平擴展測試時間計算的最佳方法?(2)延長計算能在多大程度上提高LLMs在複雜任務上的性能,較小的語言模型是否可以通過這種方法勝過較大的模型?通過對MATH-500和具挑戰性的AIME24任務進行全面實驗,我們得出以下觀察:(1)計算最優TTS策略高度依賴於策略模型、PRM和問題難度的選擇。 (2)使用我們的計算最優TTS策略,極小的策略模型可以勝過較大的模型。例如,1B LLM在MATH-500上可以超越405B LLM。此外,在MATH-500和AIME24上,0.5B LLM勝過GPT-4o,3B LLM超越405B LLM,7B LLM擊敗o1和DeepSeek-R1,同時具有更高的推論效率。這些發現顯示了將TTS策略適應於每個任務和模型的特定特徵的重要性,並表明TTS是增強LLMs推理能力的一種有前途的方法。
English
Test-Time Scaling (TTS) is an important method for improving the performance
of Large Language Models (LLMs) by using additional computation during the
inference phase. However, current studies do not systematically analyze how
policy models, Process Reward Models (PRMs), and problem difficulty influence
TTS. This lack of analysis limits the understanding and practical use of TTS
methods. In this paper, we focus on two core questions: (1) What is the optimal
approach to scale test-time computation across different policy models, PRMs,
and problem difficulty levels? (2) To what extent can extended computation
improve the performance of LLMs on complex tasks, and can smaller language
models outperform larger ones through this approach? Through comprehensive
experiments on MATH-500 and challenging AIME24 tasks, we have the following
observations: (1) The compute-optimal TTS strategy is highly dependent on the
choice of policy model, PRM, and problem difficulty. (2) With our
compute-optimal TTS strategy, extremely small policy models can outperform
larger models. For example, a 1B LLM can exceed a 405B LLM on MATH-500.
Moreover, on both MATH-500 and AIME24, a 0.5B LLM outperforms GPT-4o, a 3B LLM
surpasses a 405B LLM, and a 7B LLM beats o1 and DeepSeek-R1, while with higher
inference efficiency. These findings show the significance of adapting TTS
strategies to the specific characteristics of each task and model and indicate
that TTS is a promising approach for enhancing the reasoning abilities of LLMs.Summary
AI-Generated Summary