Thinkless: LLMは考えるタイミングを学習する

要旨

推論能力を備えた言語モデル（Reasoning Language Models）は、複雑な論理的推論を必要とするタスクにおいて顕著な性能を発揮してきた。しかし、すべてのクエリに対して精緻な推論を適用することは、多くの問題が単純な解決策を許容する場合において、しばしば大幅な計算非効率を招く。これにより、LLMが「いつ考えるべきかを学習できるか？」という未解決の疑問が生じる。この問いに答えるため、我々はThinklessを提案する。これは、タスクの複雑さとモデルの能力に基づいて、短い形式と長い形式の推論を適応的に選択することをLLMに可能にする学習可能なフレームワークである。Thinklessは強化学習のパラダイムの下で訓練され、簡潔な応答のための<short>と詳細な推論のための<long>という2つの制御トークンを採用する。本手法の中核となるのは、Decoupled Group Relative Policy Optimization（DeGRPO）アルゴリズムである。これは、ハイブリッド推論の学習目的を2つの要素に分解する：(1) 推論モードの選択を制御する制御トークン損失、(2) 生成された回答の精度を向上させる応答損失。この分離された定式化により、各目的の寄与を細かく制御することが可能となり、訓練を安定化させ、従来のGRPOで観察された崩壊を効果的に防止する。実験的には、Minerva Algebra、MATH-500、GSM8Kなどのいくつかのベンチマークにおいて、Thinklessは長い連鎖的思考の使用を50%～90%削減し、推論言語モデルの効率を大幅に向上させることができた。コードはhttps://github.com/VainF/Thinklessで公開されている。

English

Reasoning Language Models, capable of extended chain-of-thought reasoning, have demonstrated remarkable performance on tasks requiring complex logical inference. However, applying elaborate reasoning for all queries often results in substantial computational inefficiencies, particularly when many problems admit straightforward solutions. This motivates an open question: Can LLMs learn when to think? To answer this, we propose Thinkless, a learnable framework that empowers an LLM to adaptively select between short-form and long-form reasoning, based on both task complexity and the model's ability. Thinkless is trained under a reinforcement learning paradigm and employs two control tokens, <short> for concise responses and <think> for detailed reasoning. At the core of our method is a Decoupled Group Relative Policy Optimization (DeGRPO) algorithm, which decomposes the learning objective of hybrid reasoning into two components: (1) a control token loss that governs the selection of the reasoning mode, and (2) a response loss that improves the accuracy of the generated answers. This decoupled formulation enables fine-grained control over the contributions of each objective, stabilizing training and effectively preventing collapse observed in vanilla GRPO. Empirically, on several benchmarks such as Minerva Algebra, MATH-500, and GSM8K, Thinkless is able to reduce the usage of long-chain thinking by 50% - 90%, significantly improving the efficiency of Reasoning Language Models. The code is available at https://github.com/VainF/Thinkless

Thinkless: LLMは考えるタイミングを学習する

Thinkless: LLM Learns When to Think

要旨

Support