o1風モデルのテスト時スケーリングの再検討：真にテスト時スケーリング能力を有するのか？

要旨

大規模言語モデル（LLM）におけるテスト時スケーリングの登場、特にOpenAIのo1シリーズに代表されるものは、推論時の計算リソース割り当てをスケーリングすることで推論能力を向上させてきた。QwQ、Deepseek-R1（R1）、LIMOなどの後継モデルはこれらの進歩を再現しているが、これらのモデルが真にテスト時スケーリング能力を有しているかどうかは未だ十分に検証されていない。本研究では、これらのo1類似モデルの長いCoT（Chain-of-Thought）が必ずしも精度を向上させるわけではなく、むしろ同じ問題に対する正解は不正解よりも短いことが多いことを発見した。さらに調査を進めると、この現象はモデルの自己修正能力と密接に関連していることが明らかになった。長いCoTにはより多くの自己修正が含まれており、これがしばしば性能の低下を引き起こす。次に、QwQ、R1、LIMOに対して逐次スケーリングと並列スケーリングの戦略を比較し、並列スケーリングがより優れたカバレッジとスケーラビリティを達成することを確認した。これらの知見に基づき、並列スケーリング戦略とCoTの長さ特性を組み合わせた「最短多数決（Shortest Majority Vote）」を提案し、従来の多数決アプローチと比較してモデルのテスト時スケーラビリティを大幅に改善した。

English

The advent of test-time scaling in large language models (LLMs), exemplified by OpenAI's o1 series, has advanced reasoning capabilities by scaling computational resource allocation during inference. While successors like QwQ, Deepseek-R1 (R1) and LIMO replicate these advancements, whether these models truly possess test-time scaling capabilities remains underexplored. This study found that longer CoTs of these o1-like models do not consistently enhance accuracy; in fact, correct solutions are often shorter than incorrect ones for the same questions. Further investigation shows this phenomenon is closely related to models' self-revision capabilities - longer CoTs contain more self-revisions, which often lead to performance degradation. We then compare sequential and parallel scaling strategies on QwQ, R1 and LIMO, finding that parallel scaling achieves better coverage and scalability. Based on these insights, we propose Shortest Majority Vote, a method that combines parallel scaling strategies with CoT length characteristics, significantly improving models' test-time scalability compared to conventional majority voting approaches.

o1風モデルのテスト時スケーリングの再検討：真にテスト時スケーリング能力を有するのか？

Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities?

要旨

Support