AdaCoT: 強化学習によるパレート最適適応型連鎖思考トリガー

要旨

大規模言語モデル（LLMs）は顕著な能力を示す一方で、高度な推論を必要とするタスクにおいて課題に直面することが多い。Chain-of-Thought（CoT）プロンプティングは推論能力を大幅に向上させるが、すべてのクエリに対して無差別に長い推論ステップを生成するため、特に単純な入力に対しては計算コストと非効率性が顕著となる。この重要な課題に対処するため、我々はAdaCoT（Adaptive Chain-of-Thought）を提案する。AdaCoTは、LLMがCoTをいつ呼び出すかを適応的に決定する新しいフレームワークであり、適応的推論をパレート最適化問題として定式化し、モデルの性能とCoT呼び出しのコスト（頻度と計算オーバーヘッド）のバランスを取ることを目指す。我々は、強化学習（RL）に基づく手法、特にProximal Policy Optimization（PPO）を利用して、ペナルティ係数を調整することでCoTトリガーの決定境界を動的に制御し、暗黙的なクエリの複雑さに基づいてCoTの必要性を判断することを可能にする。重要な技術的貢献として、多段階RLトレーニング中に決定境界の崩壊を防ぐために設計されたSelective Loss Masking（SLM）を提案し、堅牢で安定した適応的トリガーを実現する。実験結果は、AdaCoTがパレートフロンティアをうまくナビゲートし、複雑な推論を必要としないクエリに対してCoTの使用を大幅に削減することを示している。例えば、本番トラフィックのテストセットにおいて、AdaCoTはCoTトリガー率を3.18%まで低減し、平均応答トークンを69.06%削減しながら、複雑なタスクにおいて高い性能を維持した。

English

Large Language Models (LLMs) have demonstrated remarkable capabilities but often face challenges with tasks requiring sophisticated reasoning. While Chain-of-Thought (CoT) prompting significantly enhances reasoning, it indiscriminately generates lengthy reasoning steps for all queries, leading to substantial computational costs and inefficiency, especially for simpler inputs. To address this critical issue, we introduce AdaCoT (Adaptive Chain-of-Thought), a novel framework enabling LLMs to adaptively decide when to invoke CoT. AdaCoT framed adaptive reasoning as a Pareto optimization problem that seeks to balance model performance with the costs associated with CoT invocation (both frequency and computational overhead). We propose a reinforcement learning (RL) based method, specifically utilizing Proximal Policy Optimization (PPO), to dynamically control the CoT triggering decision boundary by adjusting penalty coefficients, thereby allowing the model to determine CoT necessity based on implicit query complexity. A key technical contribution is Selective Loss Masking (SLM), designed to counteract decision boundary collapse during multi-stage RL training, ensuring robust and stable adaptive triggering. Experimental results demonstrate that AdaCoT successfully navigates the Pareto frontier, achieving substantial reductions in CoT usage for queries not requiring elaborate reasoning. For instance, on our production traffic testset, AdaCoT reduced CoT triggering rates to as low as 3.18\% and decreased average response tokens by 69.06%, while maintaining high performance on complex tasks.

AdaCoT: 強化学習によるパレート最適適応型連鎖思考トリガー

AdaCoT: Pareto-Optimal Adaptive Chain-of-Thought Triggering via Reinforcement Learning

要旨

Support