無思:大語言模型學會何時思考
Thinkless: LLM Learns When to Think
May 19, 2025
作者: Gongfan Fang, Xinyin Ma, Xinchao Wang
cs.AI
摘要
推理語言模型,具備延展性思維鏈推理能力,在需要複雜邏輯推斷的任務上展現了卓越性能。然而,對所有查詢均採用精細推理往往導致顯著的計算效率低下,尤其當許多問題存在直接解決方案時。這引發了一個開放性問題:大型語言模型能否學會何時思考?為此,我們提出了“無需思考”(Thinkless),這是一個可學習的框架,使大型語言模型能根據任務複雜度及模型自身能力,自適應地選擇簡短或長篇推理模式。Thinkless在強化學習範式下訓練,並採用兩個控制標記:<short>用於簡潔回應,<long>則用於詳細推理。我們方法的核心在於解耦群體相對策略優化(DeGRPO)算法,該算法將混合推理的學習目標分解為兩部分:(1) 控制標記損失,負責推理模式的選擇;(2) 回應損失,提升生成答案的準確性。這種解耦形式實現了對各目標貢獻的細粒度控制,穩定訓練過程,有效防止了基礎GRPO中觀察到的崩潰現象。實證研究顯示,在Minerva代數、MATH-500及GSM8K等多個基準測試中,Thinkless能將長鏈思維的使用減少50%至90%,顯著提升了推理語言模型的效率。相關代碼已公開於https://github.com/VainF/Thinkless。
English
Reasoning Language Models, capable of extended chain-of-thought reasoning,
have demonstrated remarkable performance on tasks requiring complex logical
inference. However, applying elaborate reasoning for all queries often results
in substantial computational inefficiencies, particularly when many problems
admit straightforward solutions. This motivates an open question: Can LLMs
learn when to think? To answer this, we propose Thinkless, a learnable
framework that empowers an LLM to adaptively select between short-form and
long-form reasoning, based on both task complexity and the model's ability.
Thinkless is trained under a reinforcement learning paradigm and employs two
control tokens, <short> for concise responses and <think> for detailed
reasoning. At the core of our method is a Decoupled Group Relative Policy
Optimization (DeGRPO) algorithm, which decomposes the learning objective of
hybrid reasoning into two components: (1) a control token loss that governs the
selection of the reasoning mode, and (2) a response loss that improves the
accuracy of the generated answers. This decoupled formulation enables
fine-grained control over the contributions of each objective, stabilizing
training and effectively preventing collapse observed in vanilla GRPO.
Empirically, on several benchmarks such as Minerva Algebra, MATH-500, and
GSM8K, Thinkless is able to reduce the usage of long-chain thinking by 50% -
90%, significantly improving the efficiency of Reasoning Language Models. The
code is available at https://github.com/VainF/ThinklessSummary
AI-Generated Summary