通過思維干預有效控制推理模型
Effectively Controlling Reasoning Models through Thinking Intervention
March 31, 2025
作者: Tong Wu, Chong Xiang, Jiachen T. Wang, Prateek Mittal
cs.AI
摘要
推理增強的大型語言模型(LLMs)在生成最終答案之前,會明確地生成中間推理步驟,這有助於模型在複雜問題解決中表現出色。本文中,我們展示了這一新興的生成框架為更精細地控制模型行為提供了獨特機會。我們提出了「思維干預」這一新範式,旨在通過策略性地插入或修改特定的思維標記,來明確引導LLMs的內部推理過程。我們在多項任務上進行了全面評估,包括IFEval上的指令遵循、SEP上的指令層次結構理解,以及XSTest和SORRY-Bench上的安全對齊。結果表明,思維干預顯著優於基線提示方法,在指令遵循場景中實現了高達6.7%的準確率提升,在指令層次結構推理上提升了15.4%,並在使用開源DeepSeek R1模型處理不安全提示時,拒絕率提高了40.0%。總體而言,我們的工作為控制推理型LLMs開闢了一條充滿前景的新研究路徑。
English
Reasoning-enhanced large language models (LLMs) explicitly generate
intermediate reasoning steps prior to generating final answers, helping the
model excel in complex problem-solving. In this paper, we demonstrate that this
emerging generation framework offers a unique opportunity for more fine-grained
control over model behavior. We propose Thinking Intervention, a novel paradigm
designed to explicitly guide the internal reasoning processes of LLMs by
strategically inserting or revising specific thinking tokens. We conduct
comprehensive evaluations across multiple tasks, including instruction
following on IFEval, instruction hierarchy on SEP, and safety alignment on
XSTest and SORRY-Bench. Our results demonstrate that Thinking Intervention
significantly outperforms baseline prompting approaches, achieving up to 6.7%
accuracy gains in instruction-following scenarios, 15.4% improvements in
reasoning about instruction hierarchies, and a 40.0% increase in refusal rates
for unsafe prompts using open-source DeepSeek R1 models. Overall, our work
opens a promising new research avenue for controlling reasoning LLMs.Summary
AI-Generated Summary