Thinkless: LLM은 언제 사고할지 학습한다

초록

연쇄적 사고 추론이 가능한 추론 언어 모델(Reasoning Language Models)은 복잡한 논리적 추론이 필요한 과제에서 뛰어난 성능을 보여왔다. 그러나 모든 질의에 대해 정교한 추론을 적용하는 것은 종종 상당한 계산 비효율성을 초래하며, 특히 많은 문제가 간단한 해결책을 허용하는 경우에 그러하다. 이는 다음과 같은 열린 질문을 제기한다: LLM은 언제 사고해야 하는지를 학습할 수 있는가? 이를 해결하기 위해, 우리는 Thinkless라는 학습 가능한 프레임워크를 제안한다. 이 프레임워크는 LLM이 과제의 복잡성과 모델의 능력에 기반하여 짧은 형식과 긴 형식의 추론 사이를 적응적으로 선택할 수 있도록 한다. Thinkless는 강화 학습 패러다임 하에서 훈련되며, 간결한 응답을 위한 <short>와 상세한 추론을 위한 <long> 두 가지 제어 토큰을 사용한다. 우리 방법의 핵심은 Decoupled Group Relative Policy Optimization(DeGRPO) 알고리즘으로, 이는 하이브리드 추론의 학습 목표를 두 가지 구성 요소로 분해한다: (1) 추론 모드 선택을 제어하는 제어 토큰 손실, 그리고 (2) 생성된 답변의 정확성을 향상시키는 응답 손실. 이러한 분리된 구성은 각 목표의 기여를 세밀하게 제어할 수 있게 하여 훈련을 안정화하고 기본 GRPO에서 관찰된 붕괴를 효과적으로 방지한다. 실험적으로, Minerva Algebra, MATH-500, GSM8K과 같은 여러 벤치마크에서 Thinkless는 긴 사고 체인의 사용을 50%에서 90%까지 줄이며, 추론 언어 모델의 효율성을 크게 향상시킬 수 있었다. 코드는 https://github.com/VainF/Thinkless에서 확인할 수 있다.

English

Reasoning Language Models, capable of extended chain-of-thought reasoning, have demonstrated remarkable performance on tasks requiring complex logical inference. However, applying elaborate reasoning for all queries often results in substantial computational inefficiencies, particularly when many problems admit straightforward solutions. This motivates an open question: Can LLMs learn when to think? To answer this, we propose Thinkless, a learnable framework that empowers an LLM to adaptively select between short-form and long-form reasoning, based on both task complexity and the model's ability. Thinkless is trained under a reinforcement learning paradigm and employs two control tokens, <short> for concise responses and <think> for detailed reasoning. At the core of our method is a Decoupled Group Relative Policy Optimization (DeGRPO) algorithm, which decomposes the learning objective of hybrid reasoning into two components: (1) a control token loss that governs the selection of the reasoning mode, and (2) a response loss that improves the accuracy of the generated answers. This decoupled formulation enables fine-grained control over the contributions of each objective, stabilizing training and effectively preventing collapse observed in vanilla GRPO. Empirically, on several benchmarks such as Minerva Algebra, MATH-500, and GSM8K, Thinkless is able to reduce the usage of long-chain thinking by 50% - 90%, significantly improving the efficiency of Reasoning Language Models. The code is available at https://github.com/VainF/Thinkless

Thinkless: LLM은 언제 사고할지 학습한다

Thinkless: LLM Learns When to Think

초록

Support