ChatPaper.aiChatPaper

無思:大語言模型學會何時思考

Thinkless: LLM Learns When to Think

May 19, 2025
作者: Gongfan Fang, Xinyin Ma, Xinchao Wang
cs.AI

摘要

推理語言模型,具備延展性思維鏈推理能力,在需要複雜邏輯推斷的任務上展現了卓越性能。然而,對所有查詢均採用精細推理往往導致顯著的計算效率低下,尤其當許多問題存在直接解決方案時。這引發了一個開放性問題:大型語言模型能否學會何時思考?為此,我們提出了“無需思考”(Thinkless),這是一個可學習的框架,使大型語言模型能根據任務複雜度及模型自身能力,自適應地選擇簡短或長篇推理模式。Thinkless在強化學習範式下訓練,並採用兩個控制標記:<short>用於簡潔回應,<long>則用於詳細推理。我們方法的核心在於解耦群體相對策略優化(DeGRPO)算法,該算法將混合推理的學習目標分解為兩部分:(1) 控制標記損失,負責推理模式的選擇;(2) 回應損失,提升生成答案的準確性。這種解耦形式實現了對各目標貢獻的細粒度控制,穩定訓練過程,有效防止了基礎GRPO中觀察到的崩潰現象。實證研究顯示,在Minerva代數、MATH-500及GSM8K等多個基準測試中,Thinkless能將長鏈思維的使用減少50%至90%,顯著提升了推理語言模型的效率。相關代碼已公開於https://github.com/VainF/Thinkless。
English
Reasoning Language Models, capable of extended chain-of-thought reasoning, have demonstrated remarkable performance on tasks requiring complex logical inference. However, applying elaborate reasoning for all queries often results in substantial computational inefficiencies, particularly when many problems admit straightforward solutions. This motivates an open question: Can LLMs learn when to think? To answer this, we propose Thinkless, a learnable framework that empowers an LLM to adaptively select between short-form and long-form reasoning, based on both task complexity and the model's ability. Thinkless is trained under a reinforcement learning paradigm and employs two control tokens, <short> for concise responses and <think> for detailed reasoning. At the core of our method is a Decoupled Group Relative Policy Optimization (DeGRPO) algorithm, which decomposes the learning objective of hybrid reasoning into two components: (1) a control token loss that governs the selection of the reasoning mode, and (2) a response loss that improves the accuracy of the generated answers. This decoupled formulation enables fine-grained control over the contributions of each objective, stabilizing training and effectively preventing collapse observed in vanilla GRPO. Empirically, on several benchmarks such as Minerva Algebra, MATH-500, and GSM8K, Thinkless is able to reduce the usage of long-chain thinking by 50% - 90%, significantly improving the efficiency of Reasoning Language Models. The code is available at https://github.com/VainF/Thinkless

Summary

AI-Generated Summary

PDF261May 20, 2025