AdaCoT:基于强化学习的帕累托最优自适应思维链触发机制
AdaCoT: Pareto-Optimal Adaptive Chain-of-Thought Triggering via Reinforcement Learning
May 17, 2025
作者: Chenwei Lou, Zewei Sun, Xinnian Liang, Meng Qu, Wei Shen, Wenqi Wang, Yuntao Li, Qingping Yang, Shuangzhi Wu
cs.AI
摘要
大型語言模型(LLMs)展現了卓越的能力,但在處理需要複雜推理的任務時常面臨挑戰。儘管思維鏈(Chain-of-Thought, CoT)提示顯著增強了推理能力,它卻不加區別地為所有查詢生成冗長的推理步驟,導致顯著的計算成本和效率低下,尤其是對於較簡單的輸入。為解決這一關鍵問題,我們引入了AdaCoT(自適應思維鏈),這是一種新穎的框架,使LLMs能夠自適應地決定何時調用CoT。AdaCoT將自適應推理框架為一個帕累托優化問題,旨在平衡模型性能與CoT調用相關的成本(包括頻率和計算開銷)。我們提出了一種基於強化學習(RL)的方法,特別是利用近端策略優化(PPO),通過調整懲罰係數來動態控制CoT觸發的決策邊界,從而讓模型能夠根據隱含的查詢複雜度來確定CoT的必要性。一個關鍵的技術貢獻是選擇性損失掩碼(Selective Loss Masking, SLM),旨在對抗多階段RL訓練期間決策邊界的崩潰,確保自適應觸發的穩健性和穩定性。實驗結果表明,AdaCoT成功地在帕累托前沿上導航,對於不需要精細推理的查詢,大幅減少了CoT的使用。例如,在我們的生產流量測試集上,AdaCoT將CoT觸發率降低至3.18%,並將平均回應詞數減少69.06%,同時在複雜任務上保持高性能。
English
Large Language Models (LLMs) have demonstrated remarkable capabilities but
often face challenges with tasks requiring sophisticated reasoning. While
Chain-of-Thought (CoT) prompting significantly enhances reasoning, it
indiscriminately generates lengthy reasoning steps for all queries, leading to
substantial computational costs and inefficiency, especially for simpler
inputs. To address this critical issue, we introduce AdaCoT (Adaptive
Chain-of-Thought), a novel framework enabling LLMs to adaptively decide when to
invoke CoT. AdaCoT framed adaptive reasoning as a Pareto optimization problem
that seeks to balance model performance with the costs associated with CoT
invocation (both frequency and computational overhead). We propose a
reinforcement learning (RL) based method, specifically utilizing Proximal
Policy Optimization (PPO), to dynamically control the CoT triggering decision
boundary by adjusting penalty coefficients, thereby allowing the model to
determine CoT necessity based on implicit query complexity. A key technical
contribution is Selective Loss Masking (SLM), designed to counteract decision
boundary collapse during multi-stage RL training, ensuring robust and stable
adaptive triggering. Experimental results demonstrate that AdaCoT successfully
navigates the Pareto frontier, achieving substantial reductions in CoT usage
for queries not requiring elaborate reasoning. For instance, on our production
traffic testset, AdaCoT reduced CoT triggering rates to as low as 3.18\% and
decreased average response tokens by 69.06%, while maintaining high performance
on complex tasks.Summary
AI-Generated Summary