AdaCoT: 강화 학습을 통한 파레토 최적 적응형 사고 연쇄 트리거링

초록

대형 언어 모델(LLMs)은 뛰어난 능력을 보여주지만, 정교한 추론이 필요한 작업에서 종종 어려움을 겪습니다. 사고의 연쇄(Chain-of-Thought, CoT) 프롬프팅은 추론 능력을 크게 향상시키지만, 모든 질의에 대해 무차별적으로 긴 추론 단계를 생성하여 특히 단순한 입력에 대해 상당한 계산 비용과 비효율성을 초래합니다. 이러한 중요한 문제를 해결하기 위해, 우리는 LLM이 CoT를 언제 호출할지 적응적으로 결정할 수 있는 새로운 프레임워크인 AdaCoT(Adaptive Chain-of-Thought)를 소개합니다. AdaCoT는 적응적 추론을 파레토 최적화 문제로 설정하여 모델 성능과 CoT 호출과 관련된 비용(빈도 및 계산 오버헤드)을 균형 있게 조정합니다. 우리는 강화 학습(Reinforcement Learning, RL) 기반 방법, 특히 Proximal Policy Optimization(PPO)을 활용하여 패널티 계수를 조정함으로써 CoT 트리거 결정 경계를 동적으로 제어하여 모델이 암묵적 질의 복잡성에 기반하여 CoT 필요성을 결정할 수 있도록 합니다. 주요 기술적 기여는 다단계 RL 훈련 중 결정 경계 붕괴를 방지하기 위해 설계된 선택적 손실 마스킹(Selective Loss Masking, SLM)으로, 강력하고 안정적인 적응적 트리거를 보장합니다. 실험 결과, AdaCoT는 파레토 프론티어를 성공적으로 탐색하며, 정교한 추론이 필요하지 않은 질의에 대해 CoT 사용을 상당히 줄였습니다. 예를 들어, 우리의 프로덕션 트래픽 테스트셋에서 AdaCoT는 CoT 트리거 비율을 3.18%까지 낮추고 평균 응답 토큰 수를 69.06% 감소시키면서도 복잡한 작업에서 높은 성능을 유지했습니다.

English

Large Language Models (LLMs) have demonstrated remarkable capabilities but often face challenges with tasks requiring sophisticated reasoning. While Chain-of-Thought (CoT) prompting significantly enhances reasoning, it indiscriminately generates lengthy reasoning steps for all queries, leading to substantial computational costs and inefficiency, especially for simpler inputs. To address this critical issue, we introduce AdaCoT (Adaptive Chain-of-Thought), a novel framework enabling LLMs to adaptively decide when to invoke CoT. AdaCoT framed adaptive reasoning as a Pareto optimization problem that seeks to balance model performance with the costs associated with CoT invocation (both frequency and computational overhead). We propose a reinforcement learning (RL) based method, specifically utilizing Proximal Policy Optimization (PPO), to dynamically control the CoT triggering decision boundary by adjusting penalty coefficients, thereby allowing the model to determine CoT necessity based on implicit query complexity. A key technical contribution is Selective Loss Masking (SLM), designed to counteract decision boundary collapse during multi-stage RL training, ensuring robust and stable adaptive triggering. Experimental results demonstrate that AdaCoT successfully navigates the Pareto frontier, achieving substantial reductions in CoT usage for queries not requiring elaborate reasoning. For instance, on our production traffic testset, AdaCoT reduced CoT triggering rates to as low as 3.18\% and decreased average response tokens by 69.06%, while maintaining high performance on complex tasks.

AdaCoT: 강화 학습을 통한 파레토 최적 적응형 사고 연쇄 트리거링

AdaCoT: Pareto-Optimal Adaptive Chain-of-Thought Triggering via Reinforcement Learning

초록

Support