大型語言模型中推理與表現的關係——o3(迷你版)更深入地思考,而非更長時間
The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer
February 21, 2025
作者: Marthe Ballon, Andres Algaba, Vincent Ginis
cs.AI
摘要
大型語言模型在數學推理方面展現了顯著的進步,這主要得益於思維鏈和測試時計算規模的擴展。然而,關於推理標記使用與準確性提升之間的相互作用,仍存在許多未解之謎。特別是,在跨代模型比較時,性能的提升究竟源於更長的推理鏈還是更高效的推理,尚不明確。我們系統地分析了Omni-MATH基準上o1-mini和o3-mini變體的思維鏈長度,發現o3-mini (m)在無需比o1-mini更長推理鏈的情況下,達到了更高的準確性。此外,我們發現,在所有模型和計算設置中,隨著推理鏈的增長,準確性普遍下降,即便控制了問題的難度。這種準確性下降在更熟練的模型中顯著較小,這表明新一代的推理模型更有效地利用了測試時的計算資源。最後,我們指出,雖然o3-mini (h)相比o3-mini (m)實現了微小的準確性提升,但這是通過在所有問題上分配顯著更多的推理標記來實現的,即便是那些o3-mini (m)已經能夠解決的問題。這些發現為模型能力與推理長度之間的關係提供了新的見解,對效率、規模擴展和評估方法具有重要意義。
English
Large language models have demonstrated remarkable progress in mathematical
reasoning, leveraging chain-of-thought and test-time compute scaling. However,
many open questions remain regarding the interplay between reasoning token
usage and accuracy gains. In particular, when comparing models across
generations, it is unclear whether improved performance results from longer
reasoning chains or more efficient reasoning. We systematically analyze
chain-of-thought length across o1-mini and o3-mini variants on the Omni-MATH
benchmark, finding that o3-mini (m) achieves superior accuracy without
requiring longer reasoning chains than o1-mini. Moreover, we show that accuracy
generally declines as reasoning chains grow across all models and compute
settings, even when controlling for difficulty of the questions. This accuracy
drop is significantly smaller in more proficient models, suggesting that new
generations of reasoning models use test-time compute more effectively.
Finally, we highlight that while o3-mini (h) achieves a marginal accuracy gain
over o3-mini (m), it does so by allocating substantially more reasoning tokens
across all problems, even the ones that o3-mini (m) can already solve. These
findings provide new insights into the relationship between model capability
and reasoning length, with implications for efficiency, scaling, and evaluation
methodologies.Summary
AI-Generated Summary