大規模言語モデルにおける推論と性能の関係性 -- o3 (mini) はより長くではなく、より深く思考する

要旨

大規模言語モデルは、チェーン・オブ・ソート（連鎖的思考）とテスト時計算リソースのスケーリングを活用することで、数学的推論において顕著な進歩を遂げてきました。しかし、推論トークンの使用と精度向上の相互作用については、多くの未解決の疑問が残っています。特に、異なる世代のモデルを比較する際、性能の向上がより長い推論連鎖によるものなのか、それともより効率的な推論によるものなのかは明確ではありません。本研究では、Omni-MATHベンチマークにおいてo1-miniとo3-miniのバリエーションにわたるチェーン・オブ・ソートの長さを系統的に分析し、o3-mini (m)がo1-miniよりも長い推論連鎖を必要とせずに優れた精度を達成することを明らかにしました。さらに、すべてのモデルと計算設定において、問題の難易度を制御した場合でも、推論連鎖が長くなるにつれて精度が一般的に低下することを示しました。この精度の低下は、より熟練したモデルでは大幅に小さく、新しい世代の推論モデルがテスト時計算リソースをより効果的に使用していることを示唆しています。最後に、o3-mini (h)がo3-mini (m)に対してわずかな精度向上を達成するものの、o3-mini (m)がすでに解決できる問題を含むすべての問題に対して大幅に多くの推論トークンを割り当てていることを強調します。これらの発見は、モデルの能力と推論の長さの関係について新たな洞察を提供し、効率性、スケーリング、および評価方法論に示唆を与えるものです。

English

Large language models have demonstrated remarkable progress in mathematical reasoning, leveraging chain-of-thought and test-time compute scaling. However, many open questions remain regarding the interplay between reasoning token usage and accuracy gains. In particular, when comparing models across generations, it is unclear whether improved performance results from longer reasoning chains or more efficient reasoning. We systematically analyze chain-of-thought length across o1-mini and o3-mini variants on the Omni-MATH benchmark, finding that o3-mini (m) achieves superior accuracy without requiring longer reasoning chains than o1-mini. Moreover, we show that accuracy generally declines as reasoning chains grow across all models and compute settings, even when controlling for difficulty of the questions. This accuracy drop is significantly smaller in more proficient models, suggesting that new generations of reasoning models use test-time compute more effectively. Finally, we highlight that while o3-mini (h) achieves a marginal accuracy gain over o3-mini (m), it does so by allocating substantially more reasoning tokens across all problems, even the ones that o3-mini (m) can already solve. These findings provide new insights into the relationship between model capability and reasoning length, with implications for efficiency, scaling, and evaluation methodologies.

大規模言語モデルにおける推論と性能の関係性 -- o3 (mini) はより長くではなく、より深く思考する

The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer

要旨

Support