面向高效可控LLM推理的自主思維鏈引導

摘要

大型語言模型透過延伸的鏈式思維推理能提升最終答案的正確性，但常導致代幣使用效率低落，且推理過程難以控制。現有的高效推理方法雖能透過縮短、提前終止或壓縮推理軌跡來控制思考長度，卻未能明確調控模型的思考方式。本文提出「自主鏈式思維引導」（Agentic Chain-of-Thought Steering, ACTS），將推理引導形式化為馬可夫決策過程，由一個控制器代理在推論時自適應地引導已凍結的推理器。在每個步驟中，控制器觀察目前的推理軌跡與剩餘的思考預算，接著發出一個引導動作，包含推理策略與一段引導詞，用以啟動推理器的下一步驟。此方法能在保留推理器生成連續性的同時，實現預算感知的策略控制，達成高效推理。我們以自行建構、具多重預算增強的合成引導軌跡來初始化控制器代理，並進一步透過帶有預算條件獎勵塑形的強化學習進行優化。多項基準測試結果顯示，ACTS 在顯著節省代幣的同時，能達到與完整思考相當的表現，並在不同推理器與任務間實現可控的準確率與效率權衡。相關程式碼已公開於 https://github.com/Andree-9/ACTS。

English

Large language models improve final-answer accuracy through extended chain-of-thought reasoning, but often spend tokens inefficiently and offer little inference-time control. Existing efficient reasoning methods control thinking length by shortening, early-stopping, or compressing traces, leaving how the model thinks implicit. In this paper, we propose Agentic Chain-of-Thought Steering (ACTS), which formulates reasoning steering as a Markov decision process where a controller agent adaptively steers a frozen reasoner during inference. At each step, the controller observes the reasoning trace and remaining thinking budget, then issues a steering action consisting of a reasoning strategy and a steering phrase that initiates the next reasoner step. This enables budget-aware strategy control for efficient reasoning while preserving the reasoner's generation continuity. We initialize the controller agent from our constructed synthetic steering trajectories with multi-budget augmentation, and further optimize it via reinforcement learning with budget-conditioned reward shaping. Experiments across multiple benchmarks show that ACTS matches full-thinking performance with substantial token savings, and enables controllable accuracy-efficiency trade-offs across different reasoners and tasks. The code is available at https://github.com/Andree-9/ACTS.