不要为2+3=?想太多?关于o1-Like LLMs的过度思考
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
December 30, 2024
作者: Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu
cs.AI
摘要
像OpenAI o1这样的模型表现出色的原因在于它们在推理过程中能够模拟类似人类的长期思考能力。这些模型采用了扩展的“思维链”(CoT)过程,探索多种策略以增强解决问题的能力。然而,一个关键问题仍然存在:如何在测试过程中智能且高效地扩展计算资源。本文首次全面研究了这些模型中普遍存在的“过度思考”问题,即为简单问题分配过多计算资源而获益甚微。我们从结果和过程两个角度引入了新颖的效率度量标准,以评估类似o1的模型对计算资源的合理利用。通过自我训练范式,我们提出了减轻“过度思考”的策略,简化推理过程而不影响准确性。实验结果表明,我们的方法成功减少了计算开销,同时在一系列具有不同难度级别的测试集(如GSM8K、MATH500、GPQA和AIME)上保持了模型性能。
English
The remarkable performance of models like the OpenAI o1 can be attributed to
their ability to emulate human-like long-time thinking during inference. These
models employ extended chain-of-thought (CoT) processes, exploring multiple
strategies to enhance problem-solving capabilities. However, a critical
question remains: How to intelligently and efficiently scale computational
resources during testing. This paper presents the first comprehensive study on
the prevalent issue of overthinking in these models, where excessive
computational resources are allocated for simple problems with minimal benefit.
We introduce novel efficiency metrics from both outcome and process
perspectives to evaluate the rational use of computational resources by o1-like
models. Using a self-training paradigm, we propose strategies to mitigate
overthinking, streamlining reasoning processes without compromising accuracy.
Experimental results show that our approach successfully reduces computational
overhead while preserving model performance across a range of testsets with
varying difficulty levels, such as GSM8K, MATH500, GPQA, and AIME.Summary
AI-Generated Summary