面向高效可控大语言模型推理的智能体思维链引导

摘要

大型语言模型通过扩展思维链推理提升了最终答案的准确性，但常存在令牌使用效率低下且缺乏推理过程可控性的问题。现有高效推理方法通过缩短、提前终止或压缩推理轨迹来控制思考长度，却未对模型如何思考进行显式控制。本文提出基于智能体的思维链导向（ACTS）方法，将推理导向建模为马尔可夫决策过程：控制器智能体在推理阶段自适应地引导冻结的推理器。在每一步，控制器观察推理轨迹与剩余思考预算，输出包含推理策略与导向短语的导向动作，该短语用于启动推理器的下一步生成。这种方法在保持推理器生成连续性的同时，实现了面向预算感知的策略控制以实现高效推理。我们通过多预算增强技术构建的合成导向轨迹初始化控制器智能体，并进一步采用基于预算条件奖励塑造的强化学习对其进行优化。跨多个基准的实验表明，ACTS在显著节省令牌的同时达到了与完整思考相当的性能，并在不同推理器与任务间实现了可控的准确率-效率权衡。代码已开源至 https://github.com/Andree-9/ACTS。

English

Large language models improve final-answer accuracy through extended chain-of-thought reasoning, but often spend tokens inefficiently and offer little inference-time control. Existing efficient reasoning methods control thinking length by shortening, early-stopping, or compressing traces, leaving how the model thinks implicit. In this paper, we propose Agentic Chain-of-Thought Steering (ACTS), which formulates reasoning steering as a Markov decision process where a controller agent adaptively steers a frozen reasoner during inference. At each step, the controller observes the reasoning trace and remaining thinking budget, then issues a steering action consisting of a reasoning strategy and a steering phrase that initiates the next reasoner step. This enables budget-aware strategy control for efficient reasoning while preserving the reasoner's generation continuity. We initialize the controller agent from our constructed synthetic steering trajectories with multi-budget augmentation, and further optimize it via reinforcement learning with budget-conditioned reward shaping. Experiments across multiple benchmarks show that ACTS matches full-thinking performance with substantial token savings, and enables controllable accuracy-efficiency trade-offs across different reasoners and tasks. The code is available at https://github.com/Andree-9/ACTS.