AdaCoT：基于强化学习的帕累托最优自适应思维链触发机制

摘要

大型语言模型（LLMs）展现了卓越的能力，但在处理需要复杂推理的任务时常常面临挑战。尽管链式思维（CoT）提示显著增强了推理能力，但它不加区分地为所有查询生成长篇推理步骤，导致巨大的计算成本和效率低下，尤其对于较简单的输入。为解决这一关键问题，我们引入了自适应链式思维（AdaCoT），这一新颖框架使LLMs能够自适应地决定何时调用CoT。AdaCoT将自适应推理视为一个帕累托优化问题，旨在平衡模型性能与CoT调用相关的成本（包括频率和计算开销）。我们提出了一种基于强化学习（RL）的方法，特别是利用近端策略优化（PPO），通过调整惩罚系数动态控制CoT触发决策边界，从而使模型能够根据隐含的查询复杂度判断CoT的必要性。一个关键技术贡献是选择性损失掩码（SLM），旨在防止多阶段RL训练期间决策边界崩溃，确保自适应触发机制的稳健性和稳定性。实验结果表明，AdaCoT成功地在帕累托前沿上导航，对于不需要复杂推理的查询，大幅减少了CoT的使用。例如，在我们的生产流量测试集上，AdaCoT将CoT触发率降低至3.18%，并减少了69.06%的平均响应令牌数，同时在复杂任务上保持了高性能。

English

Large Language Models (LLMs) have demonstrated remarkable capabilities but often face challenges with tasks requiring sophisticated reasoning. While Chain-of-Thought (CoT) prompting significantly enhances reasoning, it indiscriminately generates lengthy reasoning steps for all queries, leading to substantial computational costs and inefficiency, especially for simpler inputs. To address this critical issue, we introduce AdaCoT (Adaptive Chain-of-Thought), a novel framework enabling LLMs to adaptively decide when to invoke CoT. AdaCoT framed adaptive reasoning as a Pareto optimization problem that seeks to balance model performance with the costs associated with CoT invocation (both frequency and computational overhead). We propose a reinforcement learning (RL) based method, specifically utilizing Proximal Policy Optimization (PPO), to dynamically control the CoT triggering decision boundary by adjusting penalty coefficients, thereby allowing the model to determine CoT necessity based on implicit query complexity. A key technical contribution is Selective Loss Masking (SLM), designed to counteract decision boundary collapse during multi-stage RL training, ensuring robust and stable adaptive triggering. Experimental results demonstrate that AdaCoT successfully navigates the Pareto frontier, achieving substantial reductions in CoT usage for queries not requiring elaborate reasoning. For instance, on our production traffic testset, AdaCoT reduced CoT triggering rates to as low as 3.18\% and decreased average response tokens by 69.06%, while maintaining high performance on complex tasks.

AdaCoT：基于强化学习的帕累托最优自适应思维链触发机制

AdaCoT: Pareto-Optimal Adaptive Chain-of-Thought Triggering via Reinforcement Learning

摘要

Support