重新審視類o1模型的測試時縮放能力：它們是否真正具備測試時縮放特性？

摘要

大型語言模型（LLMs）中測試時縮放技術的出現，以OpenAI的o1系列為代表，通過在推理過程中調整計算資源分配，顯著提升了模型的推理能力。儘管後續模型如QwQ、Deepseek-R1（R1）和LIMO複製了這些進步，但這些模型是否真正具備測試時縮放能力仍未被充分探討。本研究表明，這些類似o1的模型在更長的思維鏈（CoTs）下並未持續提升準確率；事實上，對於同一問題，正確的解決方案往往比錯誤的更短。進一步研究發現，這一現象與模型的自我修正能力密切相關——更長的思維鏈包含更多的自我修正，這通常會導致性能下降。我們隨後在QwQ、R1和LIMO上比較了順序與並行縮放策略，發現並行縮放能實現更好的覆蓋率和可擴展性。基於這些洞察，我們提出了最短多數投票法，該方法結合了並行縮放策略與思維鏈長度特性，相比傳統的多數投票方法，顯著提升了模型的測試時可擴展性。

English

The advent of test-time scaling in large language models (LLMs), exemplified by OpenAI's o1 series, has advanced reasoning capabilities by scaling computational resource allocation during inference. While successors like QwQ, Deepseek-R1 (R1) and LIMO replicate these advancements, whether these models truly possess test-time scaling capabilities remains underexplored. This study found that longer CoTs of these o1-like models do not consistently enhance accuracy; in fact, correct solutions are often shorter than incorrect ones for the same questions. Further investigation shows this phenomenon is closely related to models' self-revision capabilities - longer CoTs contain more self-revisions, which often lead to performance degradation. We then compare sequential and parallel scaling strategies on QwQ, R1 and LIMO, finding that parallel scaling achieves better coverage and scalability. Based on these insights, we propose Shortest Majority Vote, a method that combines parallel scaling strategies with CoT length characteristics, significantly improving models' test-time scalability compared to conventional majority voting approaches.

重新審視類o1模型的測試時縮放能力：它們是否真正具備測試時縮放特性？

Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities?

摘要

Support