ChatPaper.aiChatPaper

Thinkless:大语言模型学会何时思考

Thinkless: LLM Learns When to Think

May 19, 2025
作者: Gongfan Fang, Xinyin Ma, Xinchao Wang
cs.AI

摘要

具备扩展思维链推理能力的推理语言模型,在处理需要复杂逻辑推断的任务时,已展现出卓越性能。然而,对所有查询均采用精细推理往往导致显著的运算效率低下,尤其是在许多问题本身存在直接解决方案的情况下。这引发了一个开放性问题:大型语言模型(LLMs)能否学会何时进行深度思考?为解答此问题,我们提出了“Thinkless”,一个可学习的框架,使LLM能够根据任务复杂度及模型自身能力,自适应地选择简短或详细推理模式。Thinkless在强化学习范式下训练,并采用两个控制标记:<short>用于简洁回答,<think>则指示详细推理。我们方法的核心在于一种解耦的组相对策略优化算法(DeGRPO),该算法将混合推理的学习目标分解为两部分:(1)控制标记损失,负责推理模式的选择;(2)响应损失,旨在提升生成答案的准确性。这种解耦设计实现了对各自目标贡献的精细控制,稳定了训练过程,有效避免了传统GRPO中观察到的崩溃现象。实证研究表明,在Minerva Algebra、MATH-500及GSM8K等多个基准测试中,Thinkless能够将长链思维的使用减少50%至90%,显著提升了推理语言模型的效率。代码已发布于https://github.com/VainF/Thinkless。
English
Reasoning Language Models, capable of extended chain-of-thought reasoning, have demonstrated remarkable performance on tasks requiring complex logical inference. However, applying elaborate reasoning for all queries often results in substantial computational inefficiencies, particularly when many problems admit straightforward solutions. This motivates an open question: Can LLMs learn when to think? To answer this, we propose Thinkless, a learnable framework that empowers an LLM to adaptively select between short-form and long-form reasoning, based on both task complexity and the model's ability. Thinkless is trained under a reinforcement learning paradigm and employs two control tokens, <short> for concise responses and <think> for detailed reasoning. At the core of our method is a Decoupled Group Relative Policy Optimization (DeGRPO) algorithm, which decomposes the learning objective of hybrid reasoning into two components: (1) a control token loss that governs the selection of the reasoning mode, and (2) a response loss that improves the accuracy of the generated answers. This decoupled formulation enables fine-grained control over the contributions of each objective, stabilizing training and effectively preventing collapse observed in vanilla GRPO. Empirically, on several benchmarks such as Minerva Algebra, MATH-500, and GSM8K, Thinkless is able to reduce the usage of long-chain thinking by 50% - 90%, significantly improving the efficiency of Reasoning Language Models. The code is available at https://github.com/VainF/Thinkless

Summary

AI-Generated Summary

PDF291May 20, 2025