効率的かつ制御可能なLLM推論のためのエージェンティック・チェーン・オブ・ソート誘導

要旨

大規模言語モデルは、拡張された連鎖思考推論によって最終解答の精度を向上させるが、しばしばトークンを非効率的に消費し、推論時の制御がほとんどできない。既存の効率的推論手法は、推論長を短縮、早期停止、またはトレース圧縮によって制御するが、モデルがどのように推論するかは暗黙のままである。本論文では、推論誘導をマルコフ決定過程として定式化するAgentic Chain-of-Thought Steering（ACTS）を提案する。ここでは、制御エージェントが推論時に凍結された推論器を適応的に誘導する。各ステップにおいて、制御エージェントは推論トレースと残りの思考予算を観測し、推論戦略と次の推論器ステップを開始する誘導フレーズからなる誘導行動を発行する。これにより、推論器の生成連続性を保ちつつ、予算を認識した戦略制御による効率的推論が可能となる。制御エージェントは、複数予算拡張を施した合成誘導軌跡から初期化し、さらに予算条件付き報酬形成による強化学習で最適化する。複数のベンチマークにおける実験により、ACTSは完全推論性能と同程度の性能を達成しつつ大幅なトークン節約を実現し、異なる推論器やタスクにわたって制御可能な精度と効率のトレードオフを可能にすることを示す。コードはhttps://github.com/Andree-9/ACTSで入手可能である。

English

Large language models improve final-answer accuracy through extended chain-of-thought reasoning, but often spend tokens inefficiently and offer little inference-time control. Existing efficient reasoning methods control thinking length by shortening, early-stopping, or compressing traces, leaving how the model thinks implicit. In this paper, we propose Agentic Chain-of-Thought Steering (ACTS), which formulates reasoning steering as a Markov decision process where a controller agent adaptively steers a frozen reasoner during inference. At each step, the controller observes the reasoning trace and remaining thinking budget, then issues a steering action consisting of a reasoning strategy and a steering phrase that initiates the next reasoner step. This enables budget-aware strategy control for efficient reasoning while preserving the reasoner's generation continuity. We initialize the controller agent from our constructed synthetic steering trajectories with multi-budget augmentation, and further optimize it via reinforcement learning with budget-conditioned reward shaping. Experiments across multiple benchmarks show that ACTS matches full-thinking performance with substantial token savings, and enables controllable accuracy-efficiency trade-offs across different reasoners and tasks. The code is available at https://github.com/Andree-9/ACTS.